 Thanks for joining us and welcome to OpenInfralive. OpenInfralive is the Open Infrastructure Foundation's weekly show sharing production case studies, open source demos, industry conversations and the latest from the global Open Infrastructure community. This is episode 9 and we have some great content coming up in the next few weeks so I hope you can join us every Thursday at 14 UTC. My name is Thierry Cares and I will be your host for today. We're streaming live so we'll have plenty of opportunities during the show for Q&A. Please drop your questions into the chat throughout the show and we will answer as many as we can. Today's episode is part of a series on large-scale OpenStack infrastructure promoted by the OpenStack large-scale sale. We invite operators of large-scale deployments and get them to present how they solve a given operations challenge and discuss live between themselves their different approaches. Three weeks ago on episode 6 several operators joined us to present how they tackle OpenStack upgrades. OpenStack is released every six months and keeping up to date with the latest version is often cited as the number one pain point, especially for larger deployments. We had great presentations on that show but we could not really have a good follow-up discussion and answer that many questions. So we decided to program a new episode to have that longer discussions on upgrades. So let's go ahead and get started. Returning today we have Joshua Slater, Senior Cloud Systems Engineer at Blizzard Entertainment, Arnaud Morin, Site Reliability Engineer at OVH Cloud, Mohamed Nazer, CEO at Vexhost, Imtiaz Chaudhury, Cloud Architect at Workday, and Belmiro Morera, Cloud Architect at CERN. So to kick off the discussion I'd like to start with a question. During the part one some of you explained the motivation behind upgrading but I would like to ask everyone a slightly different question. What are the trade-offs of falling behind? I can briefly start with this one. So some of the advantages that we found of being up to date, which I guess will give us some trade-offs, is having access to the community and all the things that it provides us. The more the open-side community maintains a set of releases, some which are called supported, some which are called extended support and some that are completely unmaintained. So the closer you are to the supported versions the easier it is to get your bugs fixed. Obviously if there's any bugs that are fixed it's going to be fixed in the most recent release and probably backported. And in my experience developers do a good effort in backporting fixes but they're not going to backport it like six releases behind. It's just a lot of work. There's going to be merge conflicts. So you're a lot more likely to get your fixed backported in the newer release that you're running. And then obviously things like security updates and stuff like that are obviously important. That's been one of the main drivers why we really aim to stay on the latest release. Yeah, we've been really trying to maintain stability as our top priority. So when we land on a release that we know works and is predictable, maybe it does have some problems that have been fixed in a newer release but we know what those problems are, we understand them very well. We tend to stay on it as long as we're able to until there is a release that has that big feature that we're like, okay, that's the killer feature for us to finally go ahead and upgrade. And then at that point we're able to kind of accept kind of the little bit less predictability on what the new issues might be in the new release. And then of course, when you fall behind, sometimes the database migrations sometimes get a little bit less predictable. So we do kind of have that trade off. The farther we get behind, the harder it might be to actually perform the upgrade. For us, the number one reason is security. And then I would say stay close to a supported release. It's not so much new features. Like we see some new features once in a while that we need to, but security and stay on top of the latest patches are the main reasons for upgrade. Yeah, for which it's a mix of all the things you said. If you stay on an old release, at the end you will end up in the situations with security issues. You can end up with, I don't know, an open SSL library which is deprecated. Then if you want to upgrade, you cannot because your upgrade, your open stack old version will not work with the new open SSL version or situation like that. So you have to upgrade at what point before your deprecated based software is going to be very painful. And the second thing is about having new features. Because usually when you want to introduce new features, they are developed on the new open stack release, of course. So if you want to introduce them into your old open stack, you can still cherry pick, bring back, but you will have conflicts and it's more complex if your open stack release is older. So that's the two main reasons for us. So sorry, go ahead, Pomira. Well, I'm going to give you what we do at CERN. So at CERN what we try to do is always to run the latest releases. And basically you touch all the points that I was going to mention. For us, we get the support from the community. So running the latest releases help us a lot when we find an issue and we expose that issue to the community. And just to explain the issue, it's much easier to explain it in the latest release than we go back like five releases. Because then, even for developers to reproduce that is extremely difficult. Then another point is upgrades. If we are running the latest release, then we only do a jump to the next release. And that is very well tested now by all the open stack CI. If we are skipping releases, then we are putting a lot of things in our shoulders. And we need to do all those tests ourselves and make sure that everything will work fine. And we are increasing a lot to the work if we do that. And then there are the features and security patches that are always welcome. And I think you brought up a really good point, which is the community. I think the majority of us here are running community supported. So we work directly with the community and use the open source versions of open stack. And so with that, it kind of comes this thing that a community can only support so much back releases. An example of that was Ubuntu 16.04 was EOL, I think two months ago. And I know of a company that was relying on someone creating packages that were shipping for 16.04. And this individual was like, no, that's EOL. I don't have time to supply these packages. And you suddenly find yourself overnight in a situation where you can't deploy anything anymore. All of your stuff is broken. And those kind of general community maintainers are like, you know what, it's EOL. There's not much I can do. And there's only so much resources I can dedicate for this because I'm not in the business of providing a long-term supported version of the packages as a personal project. So I think that's a big important part. And on the question about skipping releases, I think that is something that we commonly do. And there are, of course, releases that you cannot skip. One example was when NovaCells was introduced. We absolutely had to upgrade to that version. And if we wanted to go to a version after that, upgrade again after that. But we do commonly do that because we're not able to keep up every six months, for sure. Yeah, that's exactly what we do. So we skip a lot of releases, but we play every migration script on the database in order to make sure that everything will be... I think it's a fast-forward mechanism, this way of upgrading. You just play migration script for each release, even if you are not targeting it. Yeah, I agree with that. And I think that one of the things is generally, the more you're involved with the code, you'll start to get a feel for the different projects. So for example, a project like Keystone, it's a relatively simple API with a database backend. And so a lot of times, other than testing it, you sometimes can just go into the code, start looking at the migrations, and you'll notice, well, there wasn't even a migration in this release. So you can just go ahead and skip it because there was no changes in the database for that release. But some other projects that will have a little bit more of complexity and things like... I think Nova Neutron are things that I'd be less confident playing this game of going and jumping releases, yeah. So what I said is, at least in our case, is what we aim. So we try to be very close to the latest release. However, it's not what we have. If we see what we have today, so CERN runs from Stein release to Wallaby. So we have projects between all these different releases. And it's precisely what you mentioned. It's completely different to upgrade projects like Lance, for example, or Keystone, or upgrade Nova Neutron. And then when we upgrade, there are so many other factors like operating systems as well. So distributions don't build packages for the latest releases. For example, you don't have the packages for the latest releases on Cento7. So it means that we also need to plan first the upgrade of the operating system before going there to the latest release. That's actually a good segue to the next theme I wanted to ask questions around, which is around the upgrade frequency. So in part one, we could see really that everyone had a different approach to their deployment and how they upgrade them. In terms of frequency, we went from vex host upgrading everything to the latest version. CERN, like Belmiro just said, upgrading to each new version, but each service being independently upgraded to others doing more of a big jump approach. And the community's position on this is to stay on relatively frequent upgrades every six months, but make it easier to upgrade several releases at the same time using fast forward upgrades. So I was wondering if you could comment on how happy you are with the way you do it or if there is like an alternate approach you would like to be able to do and if you have considered doing fast forward upgrades. I can start if you want. We are doing big jumps, upgrades, mostly because we are not able to do small steps. We would love doing small steps, small upgrades like vex host or Mohammed is doing, but it doesn't work with the whole way of working or release cycle on architecture. So we do big upgrades. We try to do an upgrade every year or two years, but while we are working on an upgrade, usually before it's completely done, the version we are targeting is already end of life because it's hard to upgrade everything in one step. You have to plan everything. So it's kind of a complex thing and it takes a lot of time. So we do big jumps. It works correctly. What we have to take care is the upgrade of the database and the data itself. It's the most complex part for us. We are usually upgrading each service at a time like CERN does as well. We will have different services on different versions because obviously some are very easy to upgrade. Some are a lot more difficult. Nova and Neutron tends to be on the oldest version out of all of those. But we would like to upgrade at least every one to two years. For Nova and Neutron itself, we'll probably upgrade some of the other ones a little bit more frequently if there's a feature or security update or something that we need to get that version in. So we are usually prompted by a killer feature or security update to get a new version of that service upgraded. But for Nova and Neutron, that's kind of our major trailing services that will probably be on the older versions and take the most time for us to test the upgrades. For us, sorry. No, sorry. Go ahead, Amtiaz. For us, we've done big jump and not so frequent, very similar to, let's say, OVH story. It is difficult to roll out or do an upgrade on an enterprise setting. Also, as I mentioned, sometimes the network provider, like an SDN provider, can add additional constraints of how often and which release you can upgrade to. So those were the reasons, but that's not what we are happy with. We've done big jumps every, tried to do every two years, but moving forward, we are trying to do more frequency from the latest release that we are working on. We started getting onto Victoria and we are trying to upgrade the story, like done from the very beginning. So we would like to upgrade as frequently as possible. Every six months would be our target. And so we're designing our platform that way. And I think the granularity of upgrading different components is another thing we are looking forward to as everyone touched upon. I think that helps us, because not all services are easy to upgrade. So we are taking that approach as well. So we will upgrade one component at a time. And something that I personally found is like the challenge with upgrades is actually most of the time, how to have an effective process of being able to do these upgrades, because that's usually the biggest reason why people are not doing these upgrades. There's usually paint points behind it. So for example, in very large environments, rolling out a change across a fleet of thousands of servers is not an easy thing to do. Whatever tool you're going to try to use for it, it's going to be painful. There's going to be small changes and differences. And so that's why we kind of took a few steps to get that to be as painless as possible. So by using something like Kubernetes as a control plane for all of our cloud, we can easily roll out a change that just goes out, and Kubernetes just goes out and rolls it out across the entire fleet one by one. Because it's also image-based, like our host systems are pretty much as pristine as they can be. They're like a cubelet running on them. And so that way there is not much of a delta from one system to another, which minimizes that, oh, surprise, this system had this weird thing happen and this other system had this other weird thing happen and makes the roll out a lot easier. And then also obviously, because it's very easy to replicate that environment, it's easy to just go ahead and see what the upgrade looks like in a kind of staging environment or something like that. Mainly like I totally see these issues and I totally agree that they are hard things, but we found that it's mostly the tooling that makes it harder than it is the open stack itself. I mean, we've got a whole bunch of customers that we manage private clouds for. So not only do we have to do an upgrade in our public cloud, we have to do an upgrade for a bunch of other private clouds that users are running. And so we really needed to come up with a way of having the lifecycle managed really easily because otherwise, you know, if planning an upgrade, like if we had to, for example, deal with some of the examples that Ahnou was dealing with, which is like, it takes us like a year or two years to do an upgrade of a cloud, we would take us like six years to get like three or four clouds done if it just would be a lot of chaos. So we invested a lot in early infrastructure to get these sort of things going. Well, I completely agree with you, Mamed. The tooling makes a lot of difference when upgrading and that can determine which is the frequency and we'll do the upgrades. But also open stack itself. So over the years, we've been seeing, we've been seeing a huge improvement on open stack deals with upgrades. It's much easier today than like five years ago that there is no doubt on that. However, I still think that the developers still need to be more aware about the challenges of operators. We continue to see, for example, like between releases, configuration changes. That basically is not a configuration change. So it's just a group change of the configuration and that causes so many troubles to operators to understand all of those changes and deploy them in a large-scale environment that requires so many testing. And actually, the outcome of that is not that much. It's just a configuration change that moved to a different place. So I continue to think that open stack and the developers need to continue to be aware about the operators' challenges and have some careful as well, be some careful before implementing these kind of changes. Yeah, I'll agree with Belmiro on that for sure because the majority of our time spent upgrading is in the testing. The rollout to like thousand servers is really not a big deal. I mean, it's all on containers and there's tons of tools to do that. But it's really the testing, the database upgrades, the config changes, things that are unpredictable on a new release that we have to really just test everything before we gain that kind of confidence. And it could be better now than it has been in the past, but it still takes us a while to get that kind of confidence that this won't cause us an issue. So one question I had for those who had to do big jumps is have you considered using fast forward upgrades, which is the community solution for not having to go to each release every six months but being able to basically do database upgrades in succession really quickly and only bring up the control plane from the release you upgrade from to the release you upgrade to because it feels like from what I understand it's more like a full upgrade approach where you just migrate completely from one version to the other rather than using the fast forward intermediary upgrades in the middle. It mostly depends on some projects that are able to jump from a release to a very newest release. We can do big jumps on some projects, but some of them we cannot and we have to do these small steps. So it really depends. And I think it depends on from which release you are running on right now and to which release you are going to also. For example, we have right now we are in a situation when we want to upgrade Newton to Stain and we know that Nova for this upgrade needs to be upgraded from we need to go to every release in the middle in order to play offline migration and when you are playing online migration that's where you need to restart at least the control plane in order to Nova to bring back everything and make sure that the data is correctly set. We also do that but we are not restarting everything we just play the online migration in a container in order to simulate the starting of the control plane. So it's not totally what upstream is doing I think but it's something similar. And it leads me to one question but maybe I can keep it for later. Go ahead. I'm wondering because we are doing big upgrades so we are shutting down every PI on our site and because we need to do upgrade of the database and we are more confident doing that offline because it takes for example one hour to migrate the data from a version to another so my question is regarding how do you do guys to lower this time of migration? Even I think if you move only from one release to another not a big jump as small it will take time for Nova if you have a lot of data it will take time to go over every line in the database it can take up to I don't know one hour and during this time Nova is not available right? Yeah that's been a problem for us as well we will tell our developer customers we don't want a private cloud or we run a private cloud not a public cloud so it's easier for us to tell our customers that the API will be down because they all work for us but I mean yeah we've I guess sort of tried to solve it by putting the database on the most expensive hardware that we have so that's pretty much how we do it People call that solving problems with credit cards but something that we've done and I've noticed helps a lot is tidying up your databases I found that most of the deployment tools that are out there don't necessarily put out of the box all the cron jobs that are available that clean up your databases so for example for Nova you can run something that goes in archives and purges all the extra records and everything like that so pre-upgrade we usually try and make sure the database is the smallest possible database that it can be because especially in for example the context of a public cloud where you've got VMs going up and down all the time you can really end up with some very very large tables and stuff like that but even running that is also really stressing on the the clouds because usually what we've noticed is if it's not run for a long time good luck you're going to be locking a lot of your tables and it's going to be very hard so it's almost like important to keep grooming your databases so basically we are running that we are trimming our databases every day so that is part of our procedures so one of our cron jobs I'm not so curious sorry Kateri now I have a question go ahead because for example on NOVA and many other projects we can do that and NOVA it's actually a bad example because it has the soft delete many of the new projects don't have the soft delete so that is not a real issue but one project that I always have issues is with Glance because some security concerns you should not delete the image table entries because otherwise someone can upload a new image with the previous UID so how are you dealing with that? on our side I feel like we don't actually get that much images I know you just published an article of 50 terabytes of images we do get a lot of images but generally I also find that you're a bit lucky because Glance is not generally one of those large database migration projects so I think we're usually pretty good on that I haven't had too many issues but I do see your point well my point is for example CERNCloud is running for more than 80 years and that is one of the databases that will always continue to grow at least there is no good solution now to avoid that well we came up with some custom scripts to still part it because actually we noticed as your database size grows the Glance API slows down and that itself slows down the VM boot time so we started basically have some regular pruning strip that cleans up both older images and older VMs and everything and that helps keeping things sort of clean yeah we do cleaning also but even with the cleaning the database is still huge and the data is still taking time to make it I guess that leads to the other the next we already talked about a lot of the database best practices you have but do you have any other best practices to make the database upgrade which has been mentioned during the the first part of this show as one of the most painful part of the upgrade process so is there any other tips or tricks that you're using to make that part of the process less painful and I guess that also applies to other external services upgrades like RabbitMQ do you have any secret source that you apply to make the RabbitMQ upgrades and database upgrades less risky, less painful less lengthy I guess I think it depends on how you use Rabbit I know some deployments use Rabbit as a message queue while other deployments use Rabbit as an actual like state thing that actually contains state and they'll have notifications and losing the state of RabbitMQ is a bad thing so for some deployments what we've done is split that into two different RabbitMQs because the one that we care about state is actually a little bit more of doesn't have that much traffic but it needs to really make sure that it's so very stable whereas another one that is kind of like are okay things have blown up so we just kill this cluster and we relaunch it and then everything messages start flowing again because honestly sometimes that's what you just need to do for Rabbit you just gotta shut it all down bring it back up because the pace at how quickly things can kind of get to a meltdown stage when Rabbit is not feeling too good is very quick and you just want to kind of get back and up and running so that's one of the things that we do we run RabbitMQ inside Kubernetes using OpenStack Helm which provides that but one of the things that were discussed recently at the PTG which is the company behind RabbitMQ has started to build a RabbitMQ operator and so operators and Kubernetes are great because the folks that build these applications can put all the logic into recovery and all the stuff that we don't know anything about and can encode it there so that we can have it running inside Kubernetes and managed by an operator so the operator would handle things like upgrades and making sure things are properly done and have computers solve that problem instead of humans so the way that we solved this was to run as much databases as possible and RabbitMQ as possible so we run more than 80 different MySQL instances and more or less the same number of RabbitMQs because we run a different SQL instance for each Nova cell and for each OpenStack project means that we don't have clusters so which makes the upgrades very simple we have only two RabbitMQ clusters one is for the Nova the top cell level the RabbitMQ cluster is for Neutron for those we try to not touch them if they are working well so we don't try to we don't touch them and we upgrade them early every year or every two years if there is no secret issues on them however the others the simple RabbitMQ running for the instance every time that there is a new package it's upgraded like any other package just a question do you run a single instance of Rabbit per service or do you run clusters per service is what I was trying to say so I only run one RabbitMQ instance per cell so clustering is only for Neutron and the Nova at cell level and notifications that I forgot any other database or RabbitMQ upgrade tips before we move to the next topic about Rabbit I can just say a few words we had to upgrade Rabbit in the last year to do that we had to stop the Rabbit cluster for each region so we are running one Rabbit for one region so it's kind of a big Rabbit we would like to move to separate the Rabbit just like you do Belmiro but it's not yet done on our side anyway to upgrade the Rabbit from doing a major upgrade of the Rabbit we did it by just shutting down the old Rabbit cluster and starting a new one I think it's doable if you do small upgrades as well you can cluster new Rabbits with old Rabbits but you have to take into account some specific things we were doing too big jump for this situation so we decided to just shut down everything it's quite easy because the data is like Mohammed said there is not so much data in Rabbit it's just moving messages from a service to another so it's done very quickly so I have a question our experience is upgrading Rabbit 10Q clusters it's always messy it's always so risky so that's why we try to avoid it but considering what you said that you just shut down the old one and bring up a new one that means that for example compute nodes connect to the Rabbit 10Q using a load balancer we don't connect directly to the cluster no we do connect directly to the cluster but we have this pipet which is able to change the configuration of Nova on the compute so we did but that takes a lot of time now in thousands of nodes not that much we can not that much let's go to the next section because we already touched on a lot of those questions earlier when Mohammed mentioned benefiting from running everything containerized on Kubernetes so in terms of deployment systems during the presentations on the first part of the show we saw a range of approaches from Vexos using Loki images and OpenStack Helm and Kubernetes to others who are holding their own system using containers deployed or using Puppet or other systems so for those using community maintained deployment systems like OpenStack Helm or OpenStack Puppet are you able to rely on the built-in upgrade capabilities in those tools or do you have to design your own upgrade tools is it more useful deployment or can you also use the upgrade button in those systems in your large scale deployments so I think I can start given that I'm like a serial deployment project user so I've been Ptl for Puppet OpenStack I've been Ptl for OpenStack Ansible and currently using OpenStack Helm and I've also worked a little bit with Coal Ansible so I've done my fair share of going around and I think for the most part it depends so it depends on the project so OpenStack Helm and Puppet OpenStack I find and categorize them in more of a library they're kind of like here's a Helm chart that lets you deploy everything but it's not like give me a cloud button which is really useful for I'd say more advanced operators so someone like for example CERN or us where we're like I don't want to write all of this from scratch but I want to kind of be able to glue it all together with my own requirements and so that's something that's really nice in OpenStack Helm and Puppet OpenStack I think I see OpenStack Ansible and let's say something like TripleO in a different category which is like kind of a fully managed A to Z which is I will get you everything from onboarding and sorry Coal Ansible as well along that which is kind of more of a distribution I would call it that it actually gives you everything from A to Z like IP addresses of systems and it goes ahead and deploys everything for you but it will deploy it in the way that it likes to deploy things but it will give you the option to tune some things and change some things but you're not going to fundamentally change the way that it goes about doing it these projects tend to have upgrade jobs so I know for a fact OpenStack Ansible had upgrade jobs and we're running for a while and we pretty much had an upgrade script that it ran and that was the same thing that was suggested that you run for appointments so it would cover for the most part like a normal upgrade from one version to another obviously there's like these kind of corner cases where you know when you're running an upgrade on an empty cloud that was installed five minutes ago versus running an upgrade on a cloud that's been running for eight years there's maybe going to be timeouts there's maybe database that are going to take longer there's going to be this weird record that was created by a version of OpenStack that is now broken the migration and I have to manually fix it so like in theory it should work but there's always like the you know be ready for some stuff to look into yeah we prefer to manage all the processes ourselves for example OpenStack Puppets as you mentioned we always this was from the beginning it was quite scary at the beginning that option to synchronize automatically we of course we patched that immediately to stop that it was so scary to run that and yeah we tried to control all the process we don't let the orchestration tool to do it we used OpenStack Chef which used to be a well OpenStack project no longer anymore and they didn't have an upgrade story so there wasn't much to do but moving forward we're going with Colanceable which has an upgrade but as Mohamed Nasir mentioned like it's very prescriptive in ways where we'll see how that goes yeah on our side we mostly use OpenStack Puppets but more in a library mode just like you described Mohamed not for a great yeah and we are not using anything as far as deployment tools it's all just building our own containers from upstream code and deploying it with config management like Mudpuppet to handle that okay I think we had a question around Airship and if any of you considered using it yes so do you all ever deploy OpenStack with Airship is this really easy to upgrade compared to like a more traditional package based deployment I can briefly just yeah so Airship helps one of the components of Airship one of the things you can deploy using Airship is OpenStack using OpenStack Helm so we're kind of using some of the components Airship2 recently came out which pretty much does a lot of the things that we already do and so we're looking at seeing how we can kind of adopt it and what's cool is the developers behind Airship are all running clouds that are running on Airship1 and they are having to come with adoption method but what's really cool with Airship is it really allows you to manage the entire life cycle of a region from hardware to applications on top of it to secret generation to everything you know it's a whole stack and so I think it would be really awesome and that's something that we're definitely looking into especially as we manage multiple deployments and having a way to do infrastructure as code which is what we do right now I don't know if it's much easier to upgrade I can't comment on that yeah any other comment on Airship which is an open infrastructure foundation hosted project to help with deployment of infrastructure for clarification if not we'll go to the next section where we mentioned last three weeks ago we mentioned that operating systems upgrades are another major pain point that you may all mentioned in your presentations so while open stack can be upgraded without loss of service the operating system upgrades still seem to trigger some amount of downtime so can you explain how you handle those upgrades what's the impact on your deployment and if you looked into the less disruptive ways of doing operating system upgrades like life kernel patches and other modern systems we tend to just we'll tend to just do a full like fresh install of the OS anytime we do an OS upgrade all of our control plane components are on containers so it doesn't really matter what OS is there and that saved us a lot of pain points just putting those in containers and not worrying about what the OS is underneath it so that really we don't really it doesn't factor into our upgrade plan much we just upgrade the OS as needed we just do one at a time and the APIs are usually able to stay up during that time so as far as the computes go though because we do use NUMA pinning we have difficulties with live migration so we do need to migrate VMs off off the computes in order to actually do the upgrades on the OS upgrades on the computes that's really the biggest pain point is not having live migration with NUMA pinning right now the control plane is quite easy to replace no big deal there are part of the computes like you said Joshua for this we usually run two different operating systems for example open 2 16 and open 2 18 and we can run the same open stack version on the two operating systems so if we want to move from 16 to 18 it's just a matter of live migrating every instance from a compute reinstall it and it will run the same open stack and bring back into the pull off nodes for the region it's just a matter of live migration basically on the other side we are really used to a lot so we have automation around this we deployed some mistral workflows in order to do that and it's quite easy now so I know and Joshua a follow up question like can you live migrate across OS releases for computes is that what you're saying no usually when we upgrade open stack we upgrade so we replace the control plane completely new but hypervisor are still running the old operating system but we install the new compute agent and neutral agent on this old operating system so it's running the new open stack release and you can make live migration so QMU supports live migrations from the QMU in 1804 will accept live migration from the QMU in 1804 for example but usually this live migration backed out is generally not supported so you can live migrate from a 1804 to 1804 machine but you could maybe migrate from 1804 to 1804 but you'll probably not be supported and so that's something that the distros try to try to help support is make sure that they support live migration cases across different operating systems and that's the same just about QMU on LiveVirt we solved this by just rebuilding the same QMU and LiveVirt version on both operating systems yeah is the same story applies for CentOS Belmiro do you have any comments there? so our experience on this was when we moved from scientific Linux 6 to CentOS 7 was some years ago at that time we are not able to live migrate instances and we are not able to do a place upgrade of the operating system so at that time basically we start deploying the new servers that arrive on CentOS 7 and the ones that were on scientific Linux 6 they stay there till end of life however this time we will try to live migrate instances between CentOS 7 and CentOS 8 or CentOS 3M8 or 9 we are not able to test it yet but that will happen soon in our case because the control plane runs on virtual machines upgrading the control plane operating system is quite easy we just replaced the virtual machine with the new OS the real issue is the compute nodes so we will try this live migration between the different versions I also think live migrations have come a long way when you are talking about migrating from scientific Linux 6 to 7 versus where we are today it has become a lot more stable in my experience overall but just to close out on our side we do similar things to what OVH does so live migrate live migrate and swap the host from underneath them so can you also touch the kernel upgrades in the compute nodes I am curious as well to hear how you guys are doing that so what we are trying to do because that was always an issue so to not disrupt VMs we had the compute nodes running for a long long time without reboots so we are now trying to live migrate again live migrate live migrate so we build a tool and we are trying to leverage that to basically going through all the cells and empty the compute node automatic rebooted compute node and make it available again to the open stack scheduler that is exactly what we did we used mistral workflow in order to do that mistral workflow and we just empty a compute install the new kernel reboot and move over to another one I don't know you use block storage your instance only boot from block storage or you also have fmroll we also have local disk we have both actually we have more local disk than block storage so yeah it can take time because you have to move the disk to the other compute yeah yeah it takes a lot of time and at least in our experience sometimes it's extremely slow we don't understand yet the issues you have to enable some flags in libvirt it depends on which version you have but some flags in recent libvirt version are helping a lot to migrate especially if the instance is having high loads high CPU usage or high RAM usage some of them we still have some instance very few instances which are not being able to be live migrated and for these specific cases usually we try to contact the client and we just make an arrangement in order to shut down the instance during the operation okay well it's time for us to address direct questions from the audience I think we have a few questions lined up first one on usage of containers is there someone here who is not using containers or is everyone already moved to containers and I guess that leads to wider question for those who don't run containers do you wish you could run containers preventing you from running containers or do you see the exact same flexibility with other packaging systems so at OVH we are not using containers yet we are moving on to use containers so some of the services are running on containers but most of them are not it's not a big deal in my opinion because at OVH the main idea is to manipulate a dedicated server and we have a lot of automation around manipulation of dedicated servers so it's quite easy for us to manipulate a dedicated it's not as easy as a container of course but we have a lot of robots which are installing images building custom red devices suffice that so it's quite easy for us to manage everything using dedicated servers but it's because OVH has this specific mechanism I think we're still not on containers on the computes I think there was some issue we had with the neutron Linux bridge agent that we've left that in a python virtual environment to have both Nova and neutron agents that are running on the computes at least still in virtual environments and not containerized but the control plane itself is on containers so in our case we are testing so we have an entire region when I say entire region looks like a lot of things but it's a very small region and that it was completely deployed using elm charts like Mohammed mentioned even the compute site and allow us to test several things and then in the main region we also have for example glance that half of the traffic of glance goes through a setup that is on top of containers so we are testing different ways how to deploy and how to manage open stack deployment based on containers we used packages before but we've moved on to containers and we containerized existing deployments but moving forward the new deployments are fully containerized as I mentioned going with co-lensable and one of the reasons we want to do it is it keeps things isolated controllers we were running multiple services let's say it's like Nova glance everything runs and sometimes even if it's not an open stack we actually had an issue where some other package, python package getting installed on the server for security reason interacted with another package and caused other problems so that's one of the major drivers of keeping things as isolated as possible in our case we're pure Kubernetes all the way there's nothing that the host gets touched it just runs kubelet and that's it yeah and to be clear when we say running on containers it's running open stack on containers running open stack itself deploying open stack itself using containers and or Kubernetes it's not like about VMs versus containers which is another topic for workloads so it's open stack on containers not containers on open stack or VMs on open stack maybe next question from the chat I haven't looked at what we have questions feel free to still ask questions we still have a few minutes until the end of the show or anyone else has questions between yourselves oh ok on the topic of the topic of removing images and the security side of it a topic we discussed earlier I'm wondering where does the concern come from neither the user nor the operator can set the UID in the image without deep intervention I think it's related to a link between Glance and Nova if an instance is booted on an image which is deleted on the Glance side and if Glance reused the same random ID it could lead to some issues maybe maybe you know more actually the Glance API allows the user if it's in the policy to specify the UID of the image and that was done a long time ago basically to allow replication between different sites, different regions having different lenses and they needed to keep the same UID so depending on the options that you have enabled in your infrastructure that could be an issue so it's just a matter of policy yes one last question have you seen production use cases where OpenStack is managed by Kola Ansible I guess that's one for Mohammed well I think I know there's one right next to you Tiri that is running that because I know in Tia as you mentioned you're using Kola Ansible yes we are using Kola Ansible well it's not yet in production we're just working on the latest Victoria release but we'll be in production soon like later this year yeah but other than that I know that of a few deployments that are a few there's probably a lot of deployments that use it I know there's a lot of companies that build kind of deploy clouds especially more I think in the HPC sites it works a lot on Kola Ansible and a lot of their deployments are using Kola Ansible for this sort of thing okay I think it's time for us to wrap up the show for today I want to extend my thanks to all of our great speakers today I really appreciate you all joining us and sharing so many insights on your large scale deployments next week on OpenInfralive we'll have another great episode lined up with Victoria Martin Estela Cruz who will be joined by several other members of the community to talk about internships, mentoring and how to get started contributing to open infrastructure projects there will be an awesome discussion on the power of mentoring in open source projects so mark your calendars and feel free to submit ideas for future episodes at ideas.openinfralive openinfralive I hope you will all be able to join us next Thursday at 14 UTC for this mentoring episode and thanks again to Belmiro, Joshua, Arnaud, Mohamed and Imtiaz and see you all next week on OpenInfralive