 Welcome to Open Infra-Live, the OpenStack Foundation's weekly show covering open source infrastructure. I am Jonathan Bryce. I am going to be your host today, and I am really excited about this show that we have in store for you today. We have some great guests, and we are going to be talking about what is underlying everything that we do in the tech industry. A lot of times we talk about the new exciting technologies like serverless, and Kubernetes, and containers, and virtual machines, and interplanetary file systems, and all of these kinds of things. But ultimately, all of those technologies are running on hardware. They're running on servers, and disk drives, and data centers. So how is that hardware managed in this day and age? What should we be thinking about when we want to run data centers full of hardware to enable all of those great use cases? Well, we have some excellent guests who are experts in this. They are building the software that makes this possible and running it in production, and we are going to be talking about bare metal and hardware, and ironic, and all of the related topics to that. So great show today. Really excited to have everybody here. We're going to start off with a presentation from Julia, who is going to walk us through through some ironic background and information. So Julia Krieger from Red Hat is going to get us started. Welcome, Julia. Thank you, Jonathan. So when we talk about bare metal and what is ironic, at the highest level, it is a project. It's also a service in the project. And it helps us provide a framework for managing bare metal machines and orchestrating those machines through their life cycle, and allows us to securely redeploy, reuse, rebuild, reprevision, and provide physical machines and remotely do things with those machines without a trip in the data center. And I know a lot of people that have been in the industry a long time, you might remember going into data centers at 3 a.m. And for me, that's my passion. Why I'm involved with ironic is because I've had to go into that data center at 3 a.m. far too many times. So we can go to the next slide. So ironic is all about making it easier to work with hardware. We manage the machine life cycle from entry into removal of the hardware from production. Hardware, as Jonathan kind of hit upon this, is always at the lowest level of the infrastructure. And we work to do this in a consistent fault manner, fault-tolerant and security-aware manner while keeping the interaction easy. And kind of going back to our original days many years ago, the graphic on the right is actually something someone drew for one of our t-shirts that we actually made and how it really works. And the way we always wanted to perceive is it to be magical and easy to use, at least where it need to be and then also provide the flexibility capability where it needs to be available to do all those special things that you need to do hardware at times. So if we can go to the next slide. When we talk about typical usage, most users think about, oh, I have a physical machine, I need to put an operating system on it and then it's deployed and I'm done. And I don't know how I can do this. It does it very well actually at scale. It drives the machines through its process to deploy them and it supports multiple different methodologies. We call these drivers and you can use things like IPMI, Pixie, Redfish, Virtual Media and there's nothing really stopping anyone from using something like a thumb drive with the agent installed on it and sticking it to the machine and allowing that agent to auto discover ironic. That's one of the features we actually added a while back was to be able to do things like this because operators needed to be able to do things like that. The one thing that most users don't typically think about is there's much larger picture than just deployment. And we go to the next slide. Whenever you actually have a physical machine, at some point you have to tear it down. You have to actively go, I need to undeploy this, I need to reclaim it, I need to make sure that it's clean or sanitary or ready to be reused and then I have to update my inventory to go, oh, it's time to rebuild, I'm able to reuse this piece of hardware. And at times you might also have something in production where something is broken and you need to actually log in and rescue the machine or fix it. We call rescue a feature in the open stack community in terms of, oh, something's really broken, let me try and recover the machine somehow. It's a reality of software at times. Not every piece of software is 100% fault tolerant and sometimes we do need to get into a machine and especially at physical hardware levels, hardware breaks, disks can no longer allocate new free blocks or the controller shorts out or my favorite lightning struck next to the data center and the EM field actually caused half the switches to crash. That's a little harder to recover from. Short of power cycle. But if we go to the next slide, the entire picture is much larger and this is actually a very simplified view. When we go to actually deploy hardware, typically we're needing to either discover the hardware or we know about the hardware and we need to enroll it or we need to enroll it, learn about it, get more information about it because maybe we need to get inventory information or learn about what the disks are to be able to plan our next step or we want to use that same data to do validation of the hardware or the configuration and all of these things are possible. But before we actually send the machine into being used, we want to make sure it's clean, ready to use and then we can deploy on it and that's where our cleaning step allows to actually happen. From there, the machine moves into available, I'm sorry, yes, available state and one can deploy it or rebuild it. Actually, no, sorry, one can deploy it and once it's in use, someone can go, oh, I need to redeploy this machine exactly as it was. That's a whole nother feature called rebuild. Again, I already mentioned rescue. A lot of times, especially with legacy systems and physical data centers, you already have hardware that's been deployed via bespoke patterns or tools or whatever and we actually have a feature called adoption which allows you to take hardware that's already been deployed, adopt it into the active cloud so that you can say, I have this machine, I'm going to use it with ironic and when I'm done with the machine, ironic will take care of the rest. So if someone deploys it, it goes back into cleaning and at some point, instead of sending machines back into being available, you need to actively consider, oh, I need to retire the machine and we actually have the capability to allow an operator to go, I'm going to retire this machine, I'm not going to do it right now, I want to be done automatically, I want to be cleaned and no one will be able to use it again and that as a feature actually CERN was amazing enough to add to this function, to our feature set. So we can go to the next slide. So when we start talking about ironic, it sits at the middle of many different things and this is just a small snapshot of what's out there and largely on the lower half of the screen, this is actually the items inside of ironic's project itself. We have tooling for talking to Redfish BMCs, we have tooling to emulate Redfish BMCs, emulate IPMI BMCs, manage network and generic networking switches. We have interaction to update Neutron and actually learn about the network, physical network mapping so that we can do the appropriate things. We have the ability to export metrics via Prometheus, we have the ability to actually collect this and get metric data to begin with from the machines. We have an agent to deploy the actual operating system or configuration you desire. We have drivers, we have ironic at score. We have many projects, some of which are actually used by commercial products to deploy operating systems onto physical machines or to manage them. These projects call it triple low Nova which is another open stack project that actually is our primary bare mill service use case provider, Bifrost and Metal Cubes. Metal Cubes actually is used by inside the Kubernetes community and is used by Airship and Red Hat OpenShift to actually deploy the operating system images to machines to provide the workloads. So if we go to the next slide, we've been busy. Our most recent major release was Wallaby and we added a number of enhancements for Redfish capabilities including the ability to perform out of band raid management and change if secure boot is enabled. Secure boot is one of those things you generally want to have enabled because then you're actually checking if, oh is that kernel signed? Granted that introduces some more complexity depending on what you're doing. You can't necessarily build your own kernels at that point unless you're managing your keys. That might be a feature soon. We also added a new role-based access control model and long store shorts, multi-tenancy. I'll talk about that in a moment. We had support to do NBME Secure Race which I'm sure a lot of people will be very happy about because previously it was not ideal. We had support for Anaconda based deployments and also we've kind of come to terms with when you're deploying workloads, sometimes you just need to boot an operating system. That ISO image or whatever may understand how to handle all the additional steps. So we actually now support the ability to say, I have an explicit ISO image, I need it to boot. I don't care about anything else, just boot it. And we do support that through virtual media or network booting. And one of the most powerful features is actually the ability for someone using that Ironix API directly to say, hi, I would like to actually tell you exactly what I need to have happen to this machine. And in all of this, we found some performance issues that we need to start addressing. So if we go to the next slide, I promise multi-tenancy. And Ironix has a unique model. It has a model where you have the system itself, you have what are called owners and what you call leases. And owners and leases are represented by projects in open stack. These are fields that are manageable by an admin in the system. And what this allows Ironix to do is provide delegate levels of access to the machines. So if you have a system user, they may be an admin role, a member role, or just a reader. If someone needs to be able to see and do accounting or script needs to be able to run and do checking or inventory or whatever. People's use cases do differ where you might have owners. And that might be, I have trust organization. I don't want to think, I have Boston University as one tenant and I have MIT as another tenant and I need to provide metal to them and they can manage their own metal. But sometimes those organizations managing those same clusters through the API need to go, hey, researchers over at this other university, I can lease you this machine or lend it to you and you can still do some basic things. It's not, it doesn't allow you to do everything but it allows you to do the needful which is power on, power off. And I think actually redeploy the machine if you absolutely need to. But it's again, it's the most restrictive use. So if we go to the next slide, this is actually driven some of our current cycle development which is improving our list performance. The downside of adding such major functionality is we realized that we were really not in great performance for API. So we spend a lot of time working on this and we've actually backported it into our wall B release because it is such a massive impact and improvement for users. Or working and adding the ability to set bios registry, actually sorry, no. Working on the ability to retrieve bios registry setting data, this sort of information is actually the thing that tells you what the setting does. I know a lot of people when they see a setting in a computer, they may not necessarily know what it means or it might mean different things based on the manufacturer. Another addition that we're working on is being able to explicitly say I need this in BIOS boot mode or Ufiboot mode. Probably an important thing with this is Ufiboot is becoming the standard and a lot of people still need to use BIOS mode or a lot of people still prefer BIOS mode. And on top of this, we support the ability to now say we're actually working on the ability to support. The ability to explicitly say I always need this machine in secure boot mode. We're also working adding the ability to have more insight via the API in terms of errors. The whole idea of here being it's a bad idea to have to go look at the actual backend log files to figure out when bad things happen. We do provide the most recent error but it's just not enough sometimes. And what some contributors are working on is ability to support attestation and key line. Key line is specifically an attestation software package that allows measurements of the machine to be taken and be compared. And the whole idea with integrating into Tyronic is who has a machine works through the lifecycle, we're able to actually measure the machine and make sure it's still in the state we expect. So we can go to the next slide. Going back to the performance, I can't really understate how much of an improvement this is. We went from 1,100 seconds on a test data set of 11 or 113,000 machines to about 210, 220 seconds to retrieve that list. That means our retrieval rate was really bad and we were able to actually make it six times faster which I'm still impressed by. We do have some additional things we could probably do. We don't believe there's a huge gain there but if people find such issues it would really help for us to understand them and quantify them. That's the biggest problem we always had was quantifying the actual performance. So we can go to the next slide. In the future, we are discussing bulk operations, possibly supporting a distinct GrubPixie interface. Certificate management for Redfish which specifically would be useful if you want to do your own kernel signing. Role-based access control with non keystone usage is something that people seem to be interested in. And also we're talking about doing automatic trading of machines which should allow people to go, oh, this machine supports this major capability I can use this way. So hopefully that will be in the future. And if we want to go to the next slide, if you want to find out more you can do so at IronicBearMetal.org. On that website we have some blog posts, some videos linked to the bare metal white paper which was created by the bare metal SIG and more. If you can go to the next slide. If you're interested in joining please feel free to reach out. We have a link to our community page via hps.com slash bit.ly slash bare metal community. And I believe that is it. Great, well, thank you for running through the overview of Ironic. It sounds very capable. And some people who are watching may go, wow, you know, that's great. Julia, it seems like it does a lot of stuff but is anybody actually using it? And this is where we have additional guests who are joining us today. So let's welcome Chris and Mohamed and Mark into the show here. Welcome fellows. So maybe you guys can just take a minute to talk a little bit about your use cases for Ironic and how you're implementing those capabilities that Julia just told us about. Why don't we start with you, Chris? Absolutely, absolutely. Thank you very much. And just as a note, I am wearing the official T-shirt from the slide. Julia will get a kick out of that one. Good choice. At NVIDIA, we use OpenStack and Ironic quite heavily in our test environments and our proven grounds for hardware. Not really at a super amount of liberty to say details about what we do, but that is our main focus, is to be able to provide environments for our developers and our engineers to be able to test their device on multiple pieces of hardware and multiple operating systems. So one of our users will lease a machine and cycle through four or five operating systems in a day testing it. That's awesome. It's something that is enabling the development teams there to work with not just kind of a single flavor of server and operating system, but actually enabling this testing on a diverse set of hardware and software combinations in a really rapid way. Absolutely. Yeah, that's great, okay. And Mark, what about what you're doing in some of the environments you're working in? Thank you, Jonathan. So we're at StackHPC working in the HPC industries, at high performance computing. And I suppose there are two main use cases we have, really. So perhaps if you could bring up one of the two slides that I've got, Erin. So I think it's, yep. This one, thank you. So the first use case really is based around Koby, which is the tool we use to deploy open stack clouds. And you see this seed host on the bottom left. It's a machine we set up that runs byfrost, which deploys Ironic in a standalone environment. So it's quite a simple setup relatively for Ironic. Just a database and a web server, no message queue or anything like that. Just a single host. But it provides all the things we need to be able to provision the control plane, the hypervisors, possibly storage nodes, all of those things that are required to build a cloud upon. Obviously there's a lot more to Koby and Color Ansible and all the other tools that we use to deploy open stack on top of that. But the relevant part here in this conversation is that Ironic provides that sort of provisioning layer for our clouds. Could we see the next slide please? So the second use case we have for Ironic is for actually providing bare metal compute to users. And typically we would do that using a second instance of Ironic, not the byfrost running on the seed node. We have an Ironic service running on the control plane, so it's generally multi-node, highly available, integrated with open stack and Naver and all the other things that make it give it a more cloud-like user-friendly interface. And so this SMS lab that we're describing here is a good example of that. So it's hosted by a company called Vernglobal in Iceland. It's got 136 compute nodes. It's a mixed hypervisor and bare metal environment. And we set this up quite recently. Seems recent, but it's been going for a little while now actually. And it provides us with tests and development resources for both for stack HPC as a company but also for the community. It's not freely available and open to anyone to use, but for particular use cases and people who are interested in using it, who are working in the interests of the community, then it can be made available to them. And Sovereign Cloud Stack, which is part of the GLI-RX project, is currently making use of it to develop there. They're cloud stacks, so that's where it is. There's no SLA, but it is available under the URL below once you have an account. And there are only four hypervisors at the moment. So it's actually mostly bare metal compute. So users get direct access to the bare metal. And that's quite nice. We need to do things like test the entire deploy process of a cloud from bare metal through to a running system and an ironic allows us to do that. We also have a lot of use cases with high performance computing and often they require the highest possible performance running on bare metal. So it's a pretty important resource for us. So I think that's all I've got to say for now. Perhaps we can move on. Great, thank you for coming prepared with those slides as well. And on that first one, it was, I think it's very interesting to see how you have the seed host and then you also have kind of tenant hosts. So you're using Ironic to manage the cloud infrastructure itself as well as it being something that is then giving bare metal resources to your users as they kind of consume the cloud from that customer perspective. So that's really cool how you're using it at multiple layers there. And Mohamed, why don't you give us a quick overview of your usage with Ironic? Sure. So Mohamed from Vex host and we have a public cloud running in three regions or in the world right now on OpenStack. One of the main things that we're did a lot of work and driving towards is offering bare metal in that cloud. And so we've been kind of pushing a bit of the limits of Ironic as we were doing that with using features such as link bonding, software raid, multiple rate partitions, UfiBoot. We've definitely kept Julia and the rest of the Ironic community entertained with the amount of tiny little corners that we keep bringing up. But it's been great to work with the community. So one of the things, like I said, is we wanna have a very standardized platform in order to deliver bare metal systems on the cloud as if it's just another virtual machine purely by ANOVA and provided in a way that gives the customer full redundancy across everything. We also manage Ironic for a couple of private cloud deployments for some of our customers. So some of our customers use it for running workloads that can't really run on a virtual machine. They require physical hardware because what they're doing is they're doing things like running virtual machines to test some software. So it's not really possible. Well, you can do nested virtualization but it tends to not be so stable. So they're using this Ironic solution to get systems that they can do full virtualization inside of. It also is, we have other customers that are doing things similar to what Chris mentioned. So especially companies that specialize in building physical hardware and giving access to their QA teams and their test teams to be able to, like someone is writing a driver and they need to get access to a Linux box to test that driver. It's not like I got to make a call and get the system and put it under my desk. It's just like an API command, you get the system and introduced a whole bunch of really interesting challenges because you've got things like generally a vendor ID and a product ID for devices but it turns out when you're working with companies that are building hardware those are not necessarily accurate. There might be different stepings or might be different things to identify one or the other. So we worked on some really cool stuff on expanding the Ironic Inspector service so that it can capture those extra small things so that customers of the cloud can request a board even though it's the same model but stepping A or stepping B or something like that. So the open source and the flexibility of Ironic allowed us to really bring in a lot of automation for this sort of platform and what's really exciting is because a lot of this sits behind an OpenStack API. Well, if someone wants to run Jenkins against those systems they can just plug it in, use the Jenkins J Clouds for OpenStack plugin and they're good to go. So that's kind of some of the ways that we use Ironic and there's also a lot more benefits but I think we can chat as we're going through this together. Yeah, that's a great overview from everyone. I also noticed that in addition to the users that we have here on the show, Sean from Blizzard commented on LinkedIn and mentioned that they are also using it for an automated pipeline and to do basically maintenance within their cloud and their servers. So it's great to see other users out there who are chiming in and sharing their use cases. I love hearing about all of the ways that I used to manage and operate big data centers and I still love data centers and servers and finding ways to keep working with that. So one thing that I wanted to mention, looking at again some of the comments, you made a comment about this to Mohamed and then James on YouTube commented about the auto trading function. It seemed like this was something that people were interested in. So Julia, maybe you could elaborate on that a little bit more. Maybe we could just take a couple of minutes to dive into what that does, why that's something that could be very helpful as it comes out. So in a sense, we already kind of had some of this capability in the inspector to collect a lot of the data about the machine and what we're finding is as the scheduling mechanisms have shifted, people will need the ability to create traits and we didn't add a trade ability because one of the other aspects that we came to a conclusion at the Sydney OpenStack Summit actually was, no one's going to agree. That was probably the best operator session I've ever been in where all the operators agreed they could never agree. But what we need to provide is some ability to say, I consider this to be X trait and I need it to be able to use this way. And this is something that's under very early discussion in the community. So we would love to hear what you needed to do so that we can help that guide our development. Okay. When you mentioned the community, what is the right way to interact with the community? What's the best way for them to provide that feedback or talk about issues or other requests that they might have? Probably the best way is via IRC. And I know it seems old and antiquated but we are spread across the world and we are a very asynchronous community. So it really helps us to be able to have this stream of consciousness that occurs in IRC and chat and it gives the ability for people to ask questions and people from all around the world to be able to see the question and try and provide feedback or ask additional questions to help clarify. One of the biggest hurdles that we run into is finding the same words and that really helps us kind of keep the same language and use in the RIC channel and just keep finding the same meaning for words. And I think that the community link that we shared earlier has info on connecting to IRC as well as the mailing list and all of the places where they can reach out and talk to you directly. So that's a good place to come if they've got ideas on auto trading and how that could be helpful and where they might like to see that go or if they wanna get involved in other ways as well. Yuri in chat also just noted that we are having our PTG and mid cycles which are coming up soon. You can find out information on that via the open stack mailing list and via the community links on the ironicbearmail.org website. That's great. So I also noticed there's a question here. It's kind of, you know, what's the starter, the hello world of ironic as a user? Resky's asking this. I feel like I should ask everyone in the on the call does anyone want to take a guess because we actually haven't linked off the ironicbearmail.org as well. For me, it's DevSack with the ironic drivers is a simple playground in which we can spin up. It is using nested virtualization. So we do have some stability issues as the machine, as they get older and older. It's easy to deploy and it allows for a perfect environment to test things on. I would say it's a very integrated environment with the rest of open stack for users that don't need that integration. There is Bifrost and that actually has a very simple command of Bifrost and CLI install. Yes, and it looks like Gary was mentioning that one as well. So a couple of options, you know, if you want kind of an entire cloud environment, DevSack with ironic, if you were looking for just the ironic, just the hardware management than Bifrost. Great. I'll keep those questions coming in the chat. We definitely want to tease out anything that people want to learn. I have a couple of questions of my own, though. What, I'm curious from Mark or Chris or Muhammad. How do you handle the process of recycling hardware and taking it in and out of deployment and usage, cleaning it and sanitizing it and making sure that it's ready for somebody else to put a workload on it? I can start with this. So I thought it's a really interesting question and the interesting answer goes to what Julia mentioned is I don't think we'll ever be able to agree on one right way of doing this. It always depends. So for example, I think it really depends on the workload. Some of our customers where they're running bare metal in a trusted environment as in they're not worried about the user being malicious for the most part and want to speed up the way that they deploy systems. For example, they're skip some cleaning steps and part of ironic. So they won't do a full disk erase because that takes a long time and they'd rather recycle the systems quicker. So they're not too worried about that. But for example, the work that we're doing to prepare for a public cloud, this is one of those things where it absolutely needs to be 100% sanitized. And so we've done a lot of steps in starting to do this whereas so starting with removing access to the IPMI interface from the physical system. So a lot of times the BMCs allow you to disallow access from the operating system to talk to the IPMI, which means that no one can start changing credentials and making those systems no longer manageable. You've got things like firmwares. So depending on the hardware vendor that you use, you actually have the ability to flash a specific firmware. And so we've been as slow as it might sound, we've been trying to integrate more of those cleaning steps to include things like flashing firmware so that everything's always standardized. So if somebody decided that they didn't like the firmware that was on their system and wanted to replace it, whether they had good intentions or bad intentions, the next person online that gets that system will get a fresh system with the same exact specifications, both in the firmware and the hardware itself, which is also helpful from a support side because you don't want to spend three days troubleshooting why the single customer's machine is having this problem only to find out that the firmware was swapped out or something like that. And then it goes back to the last step, which is emptying and making sure the disks don't have late leave over any data. And so that's the one that generally is either handled by secure disk arrays, which is usually done for SATA and SAS drives or what's really nice now is they're secure arrays for NVMe, which was really nice because as like I said in the chat, if you have four terabyte NVMe disks and you've got many of them, that can take a while and you're just eating up the life of your drives for no reason. But that's kind of I'd say like to go from the lowest layer of not allowing people to make changes up until the software or disk level things. Okay. Yeah, I'd certainly agree with the things you're saying there. I think there's probably sort of a semi-trusted use case in the middle where it's not quite a public cloud but it's not also not quite your mate using your server. It's someone who has perhaps signed up to a service and is known to the business but not, but is definitely a client of theirs rather than an internal user. So in that case, we built a cleaning step that would look at the firmware versions and verify that they matched expectations. So we weren't necessarily going in and overwriting the firmware every time because that's quite a time consuming and possibly lifestand reducing activity. But it is making sure that it catches the case where someone has perhaps not even maliciously updated firmware to try something out. And it would just, if it caught an issue with that through the cleaning cycle, it would just fail the cleaning and put the node to one side until the operator can resolve the issue. So it's not an automated procedure to fix it but it did at least mean that we'd catch those problems. And for us, due to the complexity and the varying amounts of firmwares that we have to flash we actually do most of that as a secondary process through Ansible. Okay, so we had a question earlier about getting started and kind of the right way to try it out. Let's say that you are deciding to start implementing it. You probably have hardware already or probably have servers already. What's the best way to start to integrate that, to onboard that and to discover racks of hardware that are already existing or that you're building into your data center? So I guess there are a couple of ways to approach this. One thing that we do see as a trend is people actually having complete inventories. And generally we see this when people order entire racks at a time. The downside is generally a human was involved and sometimes there are typos like E instead of F and F instead of E for MAC address or the wrong IP address for BMC or whatever. So we also as one of the major features of Ironic support the ability for auto discovery to be a feature or a thing. So if you're running an additional service called Ironic Inspector you're actually able to boot the machine is gets discovered, it gets added into Ironic. The only thing you have to supply is credentials at that point. And that's a very easy thing to automate. I actually think some people on this call have done that themselves. Have any of the people on this call done that? Maybe. Actually I have for a previous employer I was tasked with who actually did by hardware by the rack would come in would be delivered would turn it in powered on. And what we did which was slightly different we weren't able to get access to all of the DHCP services to allow us to just directly pixie boot into Inspector to discover a machine. So what we did is we tapped in to bind for every DHCP every new DHCP lease that was assigned we would drop the MAC address onto a rabbit queue. We would have a worker or pool of workers pick up that MAC address, see if it was registered in Ironic if not register the node and initiate a cleaning and providing step worked out quite well for us because we couldn't have full control over the DHCP environments to begin with. This was a good solution for us that allowed us to automate the onboarding of entire racks and we tied it, we just made our lives a little easier by setting all the BMCs to DHCP address. Yeah, I have a little bit to add to that and I think it's kind of really interesting how we're talking a lot about the technical problems but sometimes I feel like there's like logistics problems that you end up with because at the end of the day we are dealing with physical things. And so for example, something that used to make things easy was previously most vendors would ship you an IPM like an IPMI with a default username and password and they would allow you to change it later. A California law was introduced not a long time ago that enforced so that you had to have a secure by default which meant that every single time you received a system it had a randomized password which creates a whole bunch of logistical challenges because now you can't really just automate that but what's happened is now vendors usually when you're buying at a large volume they'll just send you an Excel sheet with like serial number, username, password. So that like you add another step to your thing. Another thing we recently had to deal with which is a little frustrating. The machines all shipped with Pixie Boot enabled but the Melanox NIC cards, the firmware didn't have Pixie enabled and the only way to enable it was to actually be in an operating system install the Melanox tools, turn on the like Pixie Boot feature in the firmware and then restart it, which obviously at scale is just really hell. So we ended up writing a really small like Pixie booting thing that just Pixie boot into Ubuntu installed the thing enabled it and reboot it again. But it's like those sort of things where it's like you're focused here and all of a sudden you've got another problem that's completely unrelated to what you're trying to solve. And so it's things like that that a lot of times I feel the hardest which is just to get it to be picked up by Iran because everything after that is usually can be automated but the steps before can be the really tough ones in my opinion. The in real life steps. So I mentioned at the top of the show that we're talking about kind of the lowest layer of the infrastructure stack here, the physical hardware but ultimately that exists to run other applications and Julia you mentioned metal cubed and some of the integration there. Containers are obviously a growing software packaging and deployment mechanism. Kubernetes is huge and very popular. Can you talk a little bit about how ironic can enable container environments, Kubernetes environments and help people deploy those kinds of systems at scale? So there's a project called metal cubed which actually takes ironic and containerizes it and provides a custom resource definition that one can define their machines and tell it to deploy the machines and it will tell ironic to orchestrate the machines to deploy them. That is a possibility at the same time someone has to define what that machine is and provide enough information. Typically, these are operating system images that someone has to either craft or have available already. And this is one of those common operations aspects when you're dealing with physical machines just like having to go install Melnox tools for example, temporarily or you're getting shipped hardware from a vendor and all of a sudden one of the chips has to change because of supply chain issues. That's an aspect that I'm sure pretty much everyone has seen in their careers. So with metal cube, you're able to basically through Kubernetes APIs, you're able to actually manage physical machines by this integration between Kubernetes and ironic. Yes. That's very cool. What, are there any, is this something that any of you have used Mohammed, Chris, Mark? Personally, I think the cluster APIs is really, really great because one of the things that we do is when we deploy OpenStack, we deploy it on top of Kubernetes. So we've got a whole bunch of sandwiches of different stacks of technologies that we're putting one on top of the other. Metal Cubed has been really interesting and we've been following it very closely. There's a few shortcomings and the shortcomings seem to come from not ironic but metal cube itself and the integration of ironic. Last I checked it was things like for example, software raid was not, might be added by now. This was a couple of months ago but some of the features that we really wanted such a software raid or things like that were not integrated into metal cube even though they're there and ironic but I think that it's probably really close to being able at the time when I looked at it to kind of make it out. And so that's something I'm really, really interested about because essentially for us, we have kind of two use cases of bare metal. We have like the customer facing bare metal and we have the internal facing bare metal. And so for the internal one using something MetaCube would be perfect because that's the only thing that we need for is getting Kubernetes up and running. Mark or Chris, are you doing any work with containers or Kubernetes on top of ironic? Not here at NVIDIA but at a previous location I took kind of the opposite approach as Mohammed there. We went with Magnum and leveraged containers via OpenStack running Magnum deploying ironic to get a cluster. It was, so instead of using. Kubernetes cluster running on OpenStack. The physical hardware inside OpenStack as opposed to an external cluster running. So that's the approach we talk. Yeah, we've often used a similar approach to Chris running Magnum with ironic as the workers. We also use Collar to containerize ironic. So Collar is the project we use to deploy OpenStack and part of that when we require it is ironic. So we certainly are containerizing ironic itself and deploying it in that way. Yeah, yes, it is a, all of these layers are interesting to watch evolve and integrate. At the end of the day, these pieces of software are all just tools. So you use them in the ways that get things done. I want to talk about multi-tenancy because we've talked about this a couple of times and we've talked about different levels of trust and how that affects your multi-tenancy strategies. What about at the networking layer? Is anyone running multi-tenant networks inside of these bare metal environments? Yeah, I could talk about that. Yeah, so how do you do that? What's the equipment and the drivers and everything that you're using to make that happen? So typically if we're running NOVA and ironic together then we'll have Neutron involved and use that to handle the multi-tenant networking. And the driver that we tend to go for is called networking generic switch. And this driver has got a bit of a reputation. It's had a kind of a non-production or around it for a long time. Quite frankly, if you want a driver that supports many varieties of switch, it's the one to go for. It's more feature-complete than others, including networking Ansible. I have tried using some of the vendor-specific drivers and not been particularly happy with them. So that's the one we've gone for. And we've made a number of improvements to it over the years. It's fairly basic, but it will do the trick. And we use it in quite a number of environments now. We've used it with Juniper switches, Arista, Dell, probably others. Okay, great. Mohamed or Chris, any multi-tenant networking experiences? So you can go ahead, Chris, I'll go after you. I was gonna say, not necessarily, but I have some interesting things on the horizon coming for some of that. I know Julia hasn't covered any of the smart mix support in ironic, but it is there or in process of getting there. Yeah. Okay, cool. On our side, yeah, we do a lot of generics, which when we do deployments for customers, I know it's, yeah, it seems very iffy, but overall we know that it works and it's pretty simple and straightforward. The biggest reason why I personally don't really like to go with any of the vendor-specific ones, because generally we always like to stay up to date with open stack releases and vendors are not always keeping up to date with their drivers when your releases are coming out. And so that's kind of one of the reasons why we like to just stick to upstream stuff. And even though it might not be as featureful, but at least we'll know that it'll cover all the necessary needs. And another thing is usually a lot of times there will be some sort of lock-in when you go and use those drivers because if you just wanna change switches, you've got like this subscription and this depends on this API and all sorts of other fun business-y stuff. But with multi-tenor networking, I think that something kind of that's an interesting conversation around it is that we've got this like, I wouldn't call it an issue, but maybe Mark kind of hinted a little bit it's a similar thing where we've kind of got these two instances of ironics. And it's like, it would be nice to be able to have one instance of ironic where multiple clouds are interacting with so that you have one big pool of compute that you can just kind of tap into whenever you want from different distinct clouds but just the way that things are built these days with neutron needing its own stuff and you would end up with like five different neutrons or one neutron that's shared but you don't want to share it. There's like quite a few things that probably need to be ironed out before it's there but I think that's kind of gonna be an interesting thing to start looking into. I think you should join the call next week, I think, where John Garber will be talking about similar things to that that we've done where we've got bare metal machines and hypervisors moving between roles. So it's not quite the same thing but it's kind of going that direction. So, worth a listen. That sounds really interesting. And I think kind of going back to one of the themes regarding lock-in. One of the things I'm seeing a lot of in the industry is companies are coming to the point where they are very resistant to lock-in and they fundamentally need and recognize they need to be able to go, this vendor's not supplying what we need. We need to pivot completely or we need to be able to pivot for all new orders or these machines are gonna sit on loading dock for six months because vendor hasn't fixed the firmware but they still need to actually provide the services. So they'll pivot manufacturers in a heartbeat at this point. So we talked a little bit about some networking drivers there. Ironic itself has a driver model too that enables these different use cases. And so what does it take to create a driver for Ironic? Largely, it's just Python code. We have a driver model that is available. We have a number of examples that are out there. We even have a unsupported repository of examples that people can install. One of them actually has the ability to support over and I think SSH might still be there. Actually, no, SSH is gone. But basically it's a very distinct interface. It's very simple interface. And we also have the ability to orchestrate using steps, which is I have this explicit command. It's not enabled by default. I needed to run for this machine. Though those are all, it's all coded at the end of the day and that's all articulatable via code. So it's actually a pretty simple way of building drivers. I just have to understand Python. Okay, so Python drivers and you can write them for all kinds of hardware components. And I know we had at an OpenStack summit a few years back we actually had a demo with some Lego Mindstorm robots that were hitting power buttons on servers and cool things like that that are enabled when you bridge the cloud software world to these physical devices. I know someone on this call actually created a coffee driver at one point in time to drive the coffee machine. Sorry. Like Chris has been out of this. Yeah, yes, you, there is somewhere I believe in a Git repo somewhere, maybe under my name. There is an old coffee driver that would use X10 technology to be able to schedule a coffee pot brewing. And of course, you know, with Chef and Salt and things like that, you could automate a whole meal. That's very cool. So, you know, the only limit to what you can connect with Ironic is your own imagination, it sounds like. Absolutely. What's the best driver to start learning if somebody wants to play around with that? That's a really good question. And one I hadn't thought of in a long time. I think it really depends on, I think it really depends upon what someone's interacting with and what their context is. And that's one of the big problems that all communities tend to have is bridging that context gap because you might be thinking one thing the hardware's incredibly complex at times and places. So sometimes things are actually very complex and we've done quite a bit to hide a lot of that complexity from users so they don't have to think about it. So I don't have a good answer, unfortunately. Okay, so I think that we are coming up on time here. I'm just scrolling back through the comments to see if there's anything that we missed. Maybe we can just put that bitly link up one more time to give people a pointer to where they can they can go connect with the Ironic community and start to learn more and see what's possible. We've scratched the surface here but much more is possible out there. Many, many more use cases and a great place to get started there. And speaking of use cases, we are currently running our open and for foundation user survey and this is one of the ways that we collect information on use cases. So it helps us to understand how people are using Ironic and all of these different software components so that we can provide that feedback into the developer communities and also just get a good feel for what's happening in the user world. So if you are running these systems, please go to openstack.org slash user survey and fill out the user survey. And I think we've shared that link in the chat as well. So I wanna thank all of the guests who joined today. This was a really, really a fun episode. I love data centers, I love hardware, I love software, so this brings it all together and we had some awesome experts that made it a great discussion. We will be back next week and we have another good episode lined up that we're excited about. We will be having Ben Silverman who's gonna join us and give us a presentation on OpenStack Basics. This is actually an updated and refreshed version of a talk that he gave at one of our past summits that was one of the most popular talks and one of the most popular videos that we have ever posted. And so we asked Ben to come back and do a reprise and kind of bring it up to date. So really excited to see him next week, join us and do that. That'll be same time next Thursday, 1400 UTC on YouTube and LinkedIn. So find us again next week. And I wanna thank Julia and Mohamed and Mark and Chris for joining today and providing a great discussion. So with that, we will say goodbye. Great to see everyone and we'll see you next week on Open In For Live. Thank you. Thank you, bye. Thank you.