 You can hear it okay. Hi there. My name is John Villalobos. I'm with Intel Corporation. This is Vasil Cienko with Marantis and also Vlad Druck, also with Marantis. We're gonna be discussing ironic grenade, blowing up our upgrades. It's about our experiences in adding grenade support to ironic. So this is our unofficial mascot and grenade. Today we wanna talk about how we made grenade work in ironic. And we'll be going over what is grenade and ironic, grenade as a plugin, the phases of grenade, and then Vasil will be going into depth in networking of ironic in dev stack and grenade. Vlad will be going into depth about grenade testing difficulties that we had for ironic. And then we're gonna be discussing future work that we have. So what is grenade, why do we wanna use it? Grenade is a test harness to exercise the open stack upgrade process between releases. So what we do is we bring up stable, the previous stable branch, so in this case stable Newton, and we test it, upgrade it, run it, and test it again. And so this allows us to do a cold upgrade as opposed to rolling upgrades. So rolling upgrades, no downtime, hopefully no downtime, we're cold upgrade, we actually bring everything down, upgrade the database, upgrade the services, and bring it up. So this is a step in our goal to get to rolling upgrades. And there's a link there to grenade docs. And if you didn't know what ironic is, ironic is used for bare metal provisioning and normally go through the Nova service to launch bare metal nodes. And if you have your own project and you would like to run grenade as a plugin for your project, just a little tip here. You need to create, like we did, we created open stack ironic dev stack upgrade settings file and your project open stack foo, dev stack upgrade settings file, you have to create one of those. And project config, you're gonna need create a grenade job. And one of the key points is in that project config change, you need to set the grenade plugin RC environment variable to point to your project so that it knows that grenade will be run as a plugin. So grenade has various resource phases. Initially, we start up dev stack, run off of the stable Newton in this case, start the services, run a smoke test, and then we do various phases such as the early create phase. This might be networking setup. Then we have the create phase, create resources, for example, Nova will create a user security groups, et cetera. So when we're talking about grenade, we are upgrading everything, not just ironic. It's gonna be upgrading Nova, Neutron, things like that. Then we have the verify pre-upgrade phase. We verify that these resources were created. Then we shut everything down. This is what I was talking about, cold upgrade. We save the state, shut everything down, and then run a verify no API. So things that don't need the services up, but should still be working. So in our case, we ping, we ping a bare metal system. Then we upgrade to the current master with the proposed patch, because we run this in our gate. So we upgrade current master with the proposed patch and start the services. Then we do a verify again, post-upgrade phase, destroy phase, and Veseal will be going into detail about how these things impact our networking and other things. And then finally, we run a final smoke test. And now I'm gonna give it to Veseal. Yeah, so. What should you know first before I start talking about networking? So as John mentioned, Ironic works with hardware servers, but on CI, we emulate them by KVM VMs. And Ironic requires several things. It requires access to VM during provisioning from control plane. So we need to have L3 connectivity between conductor and VM during provisioning. That's why the networking looks a little bit complex. So on this slide, on this picture, you can see how networking connection in Neutron with OVS looks like before we added an Ironic VMs. So important thing is that when we create a new network in Neutron, Neutron pick random internal tag for this network. You can see this tag on ports inside integration bridge. Then Neutron connects its namespaces, router namespace and dashspnamespace and set the tags inside integration bridge. The next step is to create barometer bridge and create Ironic nodes. The next step is to set up connection between nodes and Neutron network. So we should pick and set the right tag on the OVS tab port inside integration bridge. And only after this, we can launch base smoke tests as we have network connectivity to VMs. The next thing what is happening is early create phase. During this phase, we create the resources. We create a new network. Actually, it's more Neutron resource than Ironic, but we need to create it because we need non-overlapping IP subnet. And on this picture, you can see that when we created a new network, Neutron picked tag 20 and set them on appropriate ports. So now we need to make sure that our new network, our new Neutron resource works, so we need to move our VMs to this new network. At this step, we do not create and do not enroll more Ironic nodes because in real environment, when you have, they are already in the cluster, right? So we need to change the tag on the OVS tab and only after this, we can verify resource. The next thing that is happening, we run create-verify-resource phase. We created an instance. It was placed on one of Ironic nodes and then we launch verification, which is just actually pink instance. Next step is to shut down all services and make sure that our resources are accessible when services are in shutdown state. So we shut down services and verify that resources are available. Next step is to upgrade our components. So we upgrade them, run the immigration and start the services. Important point here is that when we started Neutron, Neutron picked a new tag for previous network. So now private network is tagged with tag 11 and Ironic grenade with tag 21. But our VMs are still connected and marked with tag 20. So the next step is to fix it and update the tag value on OVS tab interface. At this moment, VMs, Ironic nodes back connected to grenade network so we can safely verify that resource is alive. The next step is to destroy resources so we destroy instance, we destroy networks that we created and now we should move Ironic nodes back to private network. So we just update tag value to the 11 on this example and run smoke test on upgraded environment. At this point, I will give microphone to Vlad. Thanks. So what are the difficulties that we encountered during this when we started grenade testing? So first of all, Ironic hypervisor is a bit different from other Nova hypervisors. First of all, we obviously cannot create hardware node. We need to use what is created by DevStack in our case or what is present in real data center or whatever. So in the grenade job, we have seven VMs that are created inside the DevStack VM and we use them. Apart from that, we have cleaning which also doesn't have any analogs in other hypervisors. So this brings some inconsistencies between the state of the Nova instance and Ironic node because when the request to delete an instance comes into Nova we just call to Ironic to essentially update ports and remove instance UID from a node and the instance is considered deleted. But on Ironic side, it can still be cleaning and this cleaning is enabled by default as the node goes through cleaning and instances cannot be scheduled on it. So the tests. We have seven VMs as I said on the previous slide. One of them will be used for verification. So on the resources create phase the instance will be put on one of the nodes and in smoke tests we usually don't run anymore than three instances in a single test. So we are pretty safe to run smoke tests with concurrency one which is what we are doing. So right before the target smoke test the situation might look like the following so we might have still three nodes in cleaning. It's unlikely because well cleaning takes at least usually less time than the deployment so during the time when the node on the verification phase was deployed cleaning should have already been done. But this situation is possible and we have three nodes available so we can continue with target smoke run and proceed. Also some tests were skipped or worked around so some features cannot work with Ironic. For example one of the examples is disk config which extends the partition on the instance up to the end of the disk. In Ironic we don't use it it just always extends until the end of the disk in case of X file systems. Also networking service ports remain down because we don't have an ML2 driver that actually bints them and we are working on solving this problem and I think we already have a way forward so I think it will be done pretty soon. So after all of that was done we have eight requests to boot an instance during one smoke run. So basically we boot eight instances on base smoke test and eight instances in target smoke. And for comparison in full tempers the last time I looked there were 154 requests to boot an instance so it will take significantly more time than it currently does and usually when we run a single instance it takes around five minutes to boot pin or SSH into it checks that everything is fine and delete and clean it so everything combined is around five or five and a half minutes so 154 will be a bit longer and because of that we need a multi-node job so we can at least run it more concurrently. So another thing that after we started running grenade we've discovered one issue this is the one that we don't version Aaronic Python agent which is an agent that is run on a node being deployed and it led to proposing proposal of a spec to actually version it. So the problem was we just added a couple of parameters to one of the functions in Aaronic Python agent and in the same release immediately we started using them so all the agents from previous releases were just broken they were returning type error as a response to this request so for now we just cart it around checking just if it's type error we retry without those. Another thing is to take into account is that some parts of your DevStack plugin may be used by both old release of your project and new release so they have to be written in a way that allows just this old and new version work. So in our case there were some comments in DevStack plugin OpenStack client comments that were not working on older release because they just were not present in older release. And so all of this is about that backward compatibility is important so if you have separate parts like separate projects in your separate moving parts or if I can say that in your project that need to coexist and communicate between each other you need to either ensure that they are the same version well we could just upload a new IPA image in glance and we wouldn't have encountered this problem or you should somehow communicate the backwards compatibility and guarantee that for example at least one release difference between your components should work. So multi-node is our next step in this work. The networking setup will be even more complex than what was shown above and this multi-node job will help us to test rolling upgrades which is upgrading for example API or conductor and ensuring that they can communicate properly and everything still works. Also to test multiple compute or multiple conductor environments we do not test that and we actually have some problems with takeover in multiple conductor scenarios with some drivers they are being fixed and also to increase test concurrency because for full tempers we need multi-node job otherwise we won't make it in three hours. I think I'm live. So one of the things that Vlad was talking about is we're working on the multi-node grenade so this is to support the rolling upgrade testing and we'd love to get assistance on that that's one priority for us for this cycle is to get that going we're almost there on multi-node without grenade the patches are in place and working we just need to get them merged and then we're going to be adding the grenade to that along the grenade testing to that and we do have a weekly meeting every week Wednesday at 1700 UTC I think and there's also an Etherpad tracking our progress in getting the multi-node grenade working and to summarize went over what is grenade why we want it grenade phases grenade is a plug-in the networking of ironic and dev stack and grenade and as you can see from that networking of ironic and dev stack we have a much more complicated situation I think than most projects out there due to the fact that we have VM simulating bare metal nodes and then the grenade testing difficulties for ironic one of the things to me was good is that the IPA issue that Vlad talked about it helped us discover an error in our backwards compatibility so that was the goal for this grenade job is finding issues hopefully if we had had that grenade running earlier the error never would have occurred because we would have caught it but now we do have grenade running in the gate at all time and then we'll continue working on our our future work and I think we went a little quick but thank you and I just wonder if anybody had any questions for us and a link here to an etherpad page where there's the presentation is available if you want to get it and other links are available there I'm going to gather downtime measures somehow during the upgrade procedure excuse me? I'm going to gather downtime metrics during the upgrade procedure I don't know well the goal of the rolling upgrades is we have with the multi-node we're going to have two conductor running, two APIs services running and we'll bring one down upgrade it and bring it back up and then bring the second one down bring it back up so the goal is that the services will continue to run the whole time but the the whole service won't be down there will always be something there to do the work so that's our goal is that we don't have downtime but I'm not sure if you have comments about downtime metrics right so with API it should be we bring one down but the other one will continue to run we upgrade the one that's down bring it up and then we bring down the one that's broken or not broken but stable and then upgrade it so at that point all the API services are upgraded that is the goal is to rolling upgrade down them and bring them up while continuing to have ones that are working any other questions? I can repeat or find a mic so the question was did I say that we use the stable branch for all the projects and yes the answer is yes we do that and was wondering if does that mean that everything else does work for this test and if they're broken would it break us and the answer to both of those is yes so if let's say Nova didn't work on an upgrade then our grenade job would fail but Nova is already doing they're doing grenade and they're doing grenade multi-nodes so they're ahead of us on there and our dependencies are Neutron, Nova Swift Gland's Keystone so we do depend on them and yes if they broke that would be a problem so the question is does it matter if we're going to upgrade so grenade does have ordering in there right now on how things are upgraded so they're in no mark order I forget which ones are upgraded first but we add ours in there I think we get upgraded after Nova is that correct? yeah so we're upgraded after Nova we're able to specify that there's an ordering to that and we have not yet gotten to the so when we do ours we're not going to be upgrading like Nova one piece at a time for the multi-node as far as I know we're going to be focusing on just our conductor our API because that's already you know 1, 2, 3, 4, 5, 6, so you know the goal is testing it old and new run the test okay now new run the test then go to the API old new run the test and then both new run the test which I do have concerns of how long that will take I don't know if if they share my concerns over how long it will take but I wonder how long it will take okay good excuse me do we have plans for standalone I think it should be covered by our current job too so standalone ironic the project related to standalone ironic is called defrost and I think that grenade job should be set up there right yeah yeah yeah yeah I mean we don't explicitly test standalone though standalone seems somewhat like a subset of a subset of what we do test the one thing we do want to add and we've put in an RFE about it is adding a grenade job for IPA to make sure that it continues to work as stable with new ironic but we do want to get that in there but that's unrelated to standalone yes sorry correct yes he correct me he said it's new IPA with old ironic any other questions well thank you very much for your time