 Hello, this session is on updating firmware using Ironic and Redfish in the data center and at the edge My name is Chris Dearborn. I'm a senior principal software engineer and I work for Dell and with me is David Patterson Hey, don't David Patterson, I'm a senior principal engineer with Dell I've been working with OpenStack for past six seven years or so and primarily focused at the edge right now And so I will touch on points where firmware update relates to the edge during this presentation Okay, so in Victoria firmware update now has a generic implementation in Ironic The implementation uses Redfish to communicate with the BMC's So this means that it should work with any BMC that supports the Redfish simple update method In for testing we tested this of course against Dell servers Variety of models the PowerEdge R640, XE2420, R6515 and also the R630 XD We've tested updating firmware for the iDRAC, the BIOS Intel Nix, the Perk H740p RAID controller and power supplies The firmware update feature should work for all firmware updates that are available on support.dell.com Currently the only supported image transfer protocols are HTTP and HTTPS by the BMC Firmware images need to be placed manually on a web server That's accessible to the BMC through networking. So this could be the Ironic conductor web server So we have a few different deployment scenarios. So this this first one is For the data center. So this shows Ironic with an independent web server. The firmware images are hosted on that web server Ironic can communicate with the BMC's over the LAN in the data center and again the protocol is HTTP or HTTPS The next deployment scenario We can see that the firmware images now are hosted directly on the Ironic web server is pretty much the only difference But as noted below this assumes that the Ironic web server is listening on the BMC management network So the firmware update functionality node must be in the manageable state in order to run the cleaning step Multiple firmware updates may be executed either as one cleaning step or separate cleaning steps The updates are applied to the servers in Sequentially in the order given Servers rebooted once per update All devices that a firmware update applies to are updated at once at least on Dell servers So what this means is that if you have a server that has multiple identical nicks in it that the Firmware on those nicks will all be updated at the same time when you do one firmware update A wait time in seconds may be specified for each firmware update and This is primarily used when you're updating the firmware on the BMC it causes the cleaning step to wait for the period of time that you specify before declaring success and Then either completing or proceeding with the next update So this this you would use when you're updating the firmware on the BMC It allows the BMC time to come up before ironic continues trying to use it a couple more points If you place a node into maintenance mode During a series of firmware updates it will cause those updates to pause You can put the node into maintenance mode using the command shown in the slide Any firmware update that's currently in progress will continue to completion And and then the remaining firmware updates will be paused You can also take the node out of maintenance mode and that will cause any pending firmware updates to resume if A firmware update fails. That's in a string of firmware updates It will cause a cleaning step to immediately fail And so this means that remaining firmware updates will not be applied to the node. So the first step in Applying firmware updates is to download and stage the firmware images to the web server that's accessible to the BMC The next thing you need to do is create a json file containing the update firmware cleaning step that you want executed The format of the cleaning step is shown there in the slide So some things to note the interface will always be set to management The step will always be set to update underscore firmware And then in the arg section you specify the list of firmware images that you want applied Each firmware image It's required that you specify the url to the firmware image that's hosted on the web server And then you can optionally specify that wait time that we talked about before That causes the cleaning step to wait for a period of time before declaring success So when you're doing firmware updates, you want to be very careful Because removing power from a server that it's in the process of doing firmware updates Could result in devices in the server or the server itself becoming inoperable So basically you can brick things by doing firmware updates So you may want to Check the weather make sure there are no big storms coming your way before you kick off firmware updates The next step is to go ahead and start up that cleaning step So you do that using the open stack bare metal node clean Command you pass it minus minus clean steps and then the name of your json file So optionally instead of specifying the name of the file on the command line You could specify the blob of json if you want to To monitor the progress of the firmware update you can do an open stack bare metal node list The node will initially transition to the cleaning state And then into the clean wait state So we'll sit there in the clean wait state for a while while the firmware updates are running For maximum detail you could always look in the ironic conductor log And in there you'll see redfish tasks get responses from the bmc that will contain All of the detail that the bmc Is reporting back You can also log into the bmc gui and view the job queue And finally you could always bring up the node virtual console and watch the firmware updates as they progress If a failure does occur you can get the error that occurred by doing an open stack bare metal node show With the uuid and grepping on last error So now we're up to edge considerations You want to take this dave Sure, i'd be happy to the firmware update chris is illustrating can be done To edge nodes So you can run the update right over the win The the major caveat is that we have to have a dhcp relay In place at the edge site in order for the IPa image to boot up make a dhcp request to the core open stack deployment So the relay forwards the dhcp request to the the dhcp instance running on the core open stack And then it will give an ip back to the node at the edge and things will proceed as normal another consideration when you're trying to do for firmware updates For edge nodes is a i would highly recommend that you put your for firmware images On a local HTTP server That said you could pull the images all the way over from the core open stack But you're going to have a lot of backhaul and it's going to take longer And plus if you have any kind of network outage You won't be able to get to the images. So I highly recommend You have some kind of web server inside your edge site to host the images So I would like to revisit the The topology that we kind of skimmed over for edge just to illustrate what I talked about in these two slides Chris could you bring up the other graph? So here it illustrates where the dhcp relay would sit at your edge site So you talk to the bmc to get the server to start and tell it to boot from the ipa image But when the server actually boots the ipa image, it's going to fire a dhcp request And this is where the relay comes into play. We go to the relay relay would forward it to open stack So happens to be default open stack the the dnsk mass instance managed by neutron is the actual dhcp server it'll marshal the request back To the relay and it will get to the server and everything else will work Just as chris illustrated in the other two deployment scenarios So some caveats There is no specific implementation for rolling back a firmware image if you want to roll back a firmware image Then you basically update to whatever image it is that you want to So naturally rolling back then requires that the old firmware image be available on the web server One thing to note is that the ability to skip firmware versions While you're doing an update or rolling back is highly dependent on the capabilities of the firmware itself Some some firmware allows you to jump straight from one major version to another Others require you to walk through minor versions stepwise With del servers if you try to do a firmware update to a version of the firmware that's already installed Then this will cause the firmware to be reflashed to reinstalled on the server How this behaves with other vendors may be different so troubleshooting To find further error details other than the last error that we talked about before You can look in the ironic conductor log and search for error from the bottom up You can look at the redfish redfish message That occurs just prior to the error Of course, so always look for a stack trace Um And it's worth noting that the log file locations with the ironic conductor log may vary based upon your open stack distribution When you're updating the firmware on the bmc loss of connected loss of connectivity to the bmc is normal and expected So you'll see in the log that ironic is unable to communicate with a bmc for a period of time So again, it's perfectly normal for this to happen Another thing that you can do to troubleshoot is to log into the bmc GUI and take a look at the job queue or the event log You can also bring up the virtual console for the node and take a look at that and see what's going on another thing that you can do is Use a browser curl w get or some other tool like that to validate the URLs for the firmware images That you're passing in and the jason and make sure those are correct So addressing clean step timeouts So the timeout for one cleaning step is limited If you look in ironic.coff and the conductor stands there's a property called clean callback timeout And that is defaulted to 30 minutes So this limits you to a total of 30 minutes for executing A whole series of firmware updates in a single cleaning step To get around the 30 minute limitation You can split your firmware updates into separate cleaning steps So here's an example that shows What could be the contents of your jason file and two cleaning steps instead of one So the first one updates the firmware on the bmc does that 300 second wait and then updates the firmware on the nicks So finally we have a demo for you It demonstrates doing a firmware update on a del server updating the idrak firmware and the intel nick firmware Okay, so we're going to show you, uh, you know, basically how the whole process works We had a version of firmware. That's uh the idrak firmware at 420 You can see here We have a jason file that we're passing in with the two firmware update steps one being for the nick one being for the idrak As you can see the node is in manageable state. So that means it's ready for a firmware update And now we fire the Open stack bare metal clean Passing in the clean steps and you immediately go into clean wait As you can see the idrak is starting up And it's going to Make a pixie boot request You'll get the ipa image. It will boot the ipa image and get an ip address And the node is powered off and you can see now that the drak is downloading the firmware image And a new job was created The node's rebooting Again, it will pixie boot from the ipa image And now the idrak is completely shut down This is uh sometime in the future the idrak has come back up You can see that the firmware has been updated to 422.00.00 So now it's going to do another update for the network adapter until nick So currently you can see we're at 19.0.12 You can see the old job the for updating the idrak is complete and it's downloading the firmware for the network adapter node's rebooting again Again, you can see it booting into the ipa image And so if this was an edge site, what would be happening is the dhcp request would go out to the dhcp relay which would then forward the request to the dhcp server instance running on in the core open stack environment You can see the job running in both the console in the background there and you can see it in the idrak ui as well Fast forwarding so the job is 100 done And it turns off the machine Now we can take a look and Look at the firmware at the for the nick adapter and we'll be able to see that it's updated You can see we already pointed out 4.22 is on the idrak And the nick adapter is now up to 19.5.12 If you do another uh bare metal show cleaning step, you can see that the jason is now empty as we executed all of the cleaning steps And that's it So to finish up we will talk a little bit about settings that affect firmware update in the ironic.com file The first one is in the redfish stanza. It's called firmware update status interval and this is the interval In seconds between when the bmc is polled for firmware update status So it defaults to 60 seconds If you set it to zero it will completely disable this polling The next one also in the redfish stanza is firmware update fail interval This is the interval in seconds between checking for failed firmware updates This also defaults to 60 seconds and again you can set it to zero which will completely disable it This task cleans up temporary state that's on the node in ironic And prepares the node for further use by ironic And then finally there's in the conductor stanza. There's the clean callback timeout Which we talked a little bit about before this is the amount of time that Is allowed for a single cleaning step to run before timing out It defaults to 1800 seconds or 30 minutes With this particular setting if you change it to zero then this equates to node timeout For further information you can take a look at the upstream documentation for both the redfish and idrac hardware types In addition, you could look at the node cleaning documentation for ironic And then finally there you could look at the two dmtf documents the redfish Specification and redfish schema index And that concludes our presentation Thank you very much. Thanks very much