 Hi everyone. Welcome to OpenimperLive. My name is Kristin Barrientos from the Openimper Foundation and I'm super excited to be one of your hosts today. OpenimperLive is a virtual series hosted by the Openimper Foundation showcasing production case studies, open source demos, industry conversations, and the latest updates from the global open infrastructure community. We are live on Thursdays at 15UTC streaming on YouTube, LinkedIn, and Facebook. First, I would like to thank the Openimper Foundation in platinum, gold, and silver members for supporting the Openimper mission and making the show possible. As I mentioned, we're streaming live and we will be answering questions throughout the show. So feel free to drop the questions into the into the comment section and we will answer as many as we can. Now I will hand it off to my host, my co-host, Tariq Perez from the Openimper Foundation to provide more details about today's episode. Thanks, Kristin. Some of the most popular episodes on OpenimperLive are the large-scale OpenStack show where operators of large-scale OpenStack deployments come and discuss operational challenges and solutions. Today, the large-scale OpenStack show is back for an ops deep dive into Ubisoft OpenStack deployment. Co-hosting with us for today's discussion, we have Arnaud Morin from OVH Cloud and Felix Wittner from Schwartz Group. They are joined by our guests from Ubisoft, Benjamin Furman, and Stanislav Dimitriev. Hi, everyone. To kick this off, could you, Benjamin and Stanislav, could you tell us a bit about yourselves and what brought you to this unique job that is providing infrastructure? Hi, everyone. So I'm Benjamin and I've been in Ubisoft since 2019, so more than three years. And how I joined the OpenStack environment at starting in 2014-2015 in another job when we wanted to have a new infra. And basically, I'm a developer normally, and back in time they say to me, okay, we want to have an infra as card. You are a developer, so you can do an infra. Okay. But I know nothing about infra and I discover everything. And yeah, that's been, I have six or seven years of experience around OpenStack now. What about you, Stanislav? Well, I've been around OpenStack, I'd say for like eight, nine years. I don't remember exactly, so it started long ago. And I changed several companies that I were working for. And I moved into different countries a couple of times along with this OpenStack stuff. Yeah, but I joined Ubisoft quite recently, just two and a half years ago. And yeah, that's it. Perfect. So let's talk a bit about your OpenStack usage at Ubisoft. How are you offering OpenStack to your users? What type of usage you have in the company? Could you give us a bit more details on your inside configuration? Yeah, we definitely could. So basically, we use OpenStack as private cloud offers. So our customers, they use, they could use API, they could use web dashboards, you know, like general web dashboards, a lot of them use automation like Terraform, Pulum, and stuff like that. Plus, we have, it's a bit side project, it's not management by our team, it's managed by another team, but it's like a platform as a service or manage Kubernetes clusters. So those are heavily reliant on our infrastructure. Do you have a lot of teams working with OpenStack? Is it a whole company, or is it located to, I don't know, some part of the IT? Yeah, it can be the whole company actually. That's, we have some department ITs using OpenStack for their own project, like logging, monitoring, stuff like that. And there are sort of different production Ubisoft that use the cloud directly, you know, with OpenStack APIs to perform stuff, or it could also be directly, also APIs like we have Kubernetes stuff with launchers, like on top of OpenStack or public cloud, they can, they have different choices depending on their needs. So that's the different numbers of possibilities and the overall use case in Ubisoft is very large actually. Even us, we are not aware of all the use cases, we just manage infrastructures and they can do whatever kind of stuff they want on top of that. So that's basically what you mean to default the infrastructure provider for Ubisoft? We're like internal service infrastructure provider. We're trying to compete with, you know, public offers, of course. It's not so easy to do, but we have our pros and cons. I know that pain. So that's actually very interesting. What is the differentiation? Like you said, your various internal customers can opt into the local private infrastructure, but they could also go outside to any hyperskatter or other offering. What are the pros and the cons of using the internal infrastructure that you are responsible for? Well, I believe the main pros would be the price. So we're just cheaper than the public offering, cheaper compared to AWS or Google Cloud. And I believe there might be some against security concerns because some projects still are not okay with using public, public clouds, you know, for like high sensitive data. It's a little thing, still believe it's better to store, storage in the random load inside the company and the servers that belong to the companies I work for. And I believe for the cons, I guess, mostly the feature we propose. We can't compute all the features proposed by AWS or GKE. That's kind of impossible, actually. But we propose a basic stuff, as a compute, network, load balancers, and volumes. We cannot compete with AWS in the terms of functional. We don't have such resources. I mean, that's obvious, right? But we don't have this target, we don't have this goal. As Benjamin highlighted, we're trying to provide, like, we're trying to cover the most basic cases, like a general infrastructure, that's, and I believe it's one of the most useful cases. Is there any use case that is very missing in your current OpenStack deployment or even in OpenStack itself, that you would like to have and that your customer are willing to have or that they can found in other public cloud provider, but not in your info? Functions of service. Function, okay. They are looking for that, okay. Yeah, at least that's a third thing that comes to mind, because quite often we have this kind of questions. And unfortunately, we cannot offer anything reliable. Okay. In terms of usage, is it used for everything, like, you know, CICD development, building, or is it used mostly for running game backhands? What is the type of usage, if you even know that the infrastructure is being used for? Actually, it can be both. I know that CICD for mainstaff are used inside cloud, for instance, the Git class stuff and so on. But for better to say, actually, we have, I don't know, hundreds and hundreds of different projects from production all around the world. So we don't know exactly all the details. I know that there's some, sometimes it's not the whole part of the game or the game server that can be hosted directly on our infrastructure, but some part of it, you know, depending how they conceive the production and how they conceive the infrastructure. And also note that depending on the production, they don't have always the same process, the same architecture, the same way to perform the stuff. So it's really dependent on how the different customers, internal customers of Ubisoft is using their stuff. But if you want some examples, I mean, like Benjamin said, yes, there are definitely CICDs, there are definitely build processes and some game backhands are running as well. And they have a lot of Kubernetes deployment. And of course, we have no idea what's been running on those Kubernetes classes. For us, it's just general instances. As Benjamin said, usually we don't even know. We're infrastructure provider. As also an interesting case, you can mention, we have a lot of GPUs inside our infrastructure on compute. And they are used to virtualization, you know, it can be for building games or stuff like that. We also sat on our side. Does it mean your infrastructure is generic and can be used for almost everything in your company? Or do you have some open stack regions somewhere, which are mostly designed for CICD and some other design for, I don't know, game backhands or other stuff? No, it's absolutely generic. Okay. Some regions might have more special kind of flavors, for example, more GPUs, some may have less, but at the large scale, it's generic. Okay, nice. How many regions do you have? Sounds like more than one. 15. Oh, yeah, 15. Okay. We're going to decommission three of them until the end of the year. It's going to be 12. Is it because you are moving them to, moving your customer to another region, or you have less conception? Well, the thing with those old regions there, they've been using the older architecture and say mostly network architecture. So it's mostly important, mostly impossible to bring them to the state that we want and that we use for new regions. So it's been decided to So your great strategy is to destroy and recreate, basically, at least for these regions? For these regions, yes, but we are trying to, and I believe we successfully trying to adopt another strategy with a proper life cycle, with upgrades and keeping our regions alive for quite a long time instead of deploying new ones. Okay. Can you tell us in aggregate the volume or the size of your open stack deployment, like beyond the number of regions, the number of CPUs or the number of servers that you have? Is it something you can communicate? Yeah, I think so. About the number of CPUs, I don't have the status or regions, but globally, we have around 200K cores available for partners. And I don't want to list the number, but I think it was around 12 or 30K VMs born at the same time. So that's something we are kind of specific to on our infra. We have a lot of very big VMs, but don't use very small ones. It's kind of the code used, but the majority of the VMs has around 8 or 16 CPU cores. And yeah, that's what else about the for example, like build VMs, they could easily take like 32 virtual cores and yeah, so we really have a lot of big instances. So yeah, mostly you deploy big instances. It depends. I would say the most common, most often used flavor is a medium, like an a bit higher than medium. So something like 8, 16 cores, they are in the highest demand. Do you know from your users how they use these or how they manage these VMs, if they update them or if they rather say like, let's throw them away if I need to update them, we build a new one? It's absolutely depends on the team. And we have really many different teams with different approaches. Some of them use like classic life cycles. So they're trying to maintain those instances like, you know, in this cat, pet and cattle paradigm, they try to treat them as pets, but most of the team trying to adopt this cattle approach and just destroys for new ones instead of updating all the stuff. Okay. Any other questions on the usage? No, he was going to introduce a wide open stack question. What else? Yeah. Have you started with something else before? Or do you know if Ubisoft used other cloud operating system before or even in parallel? Or is it only open stack because of open stack? Yeah, we're also having parallel some VMware stuff for other use cases, like some from Omstein too, but it's way less used compared to OpenStack. And we also have some for local studios. They have some VMware stuff too for their usage. But it's more for local stuff proposed like for data sensors on outside is not the same usage compared to local one. And for the OpenStack one, actually, that's very, we have, because I wasn't aware of the world history of OpenStack Ubisoft and I heard very recently. Started long ago. Yeah, we started in 2010, actually. Yeah, so we weren't there at all. And the people who started this story, they almost are, like, they do not work with us anymore. Okay. It's quite a wake how it started, but Ben just told me the release, it was like Diablo was the first to be deployed. And actually back in time when they tested that it was I wanted to see what is said to the cloud that it was for more in the R&D process to see, okay, what status and they tried to put that for kind of alpha production. It was better workstation not to reserve actually back in time. And I've done some for several years. Yeah, with Diablo and Grizzly. And but when I earned terms of three years ago, as old is the old architecture that was now decommissioned was in liberty, you know, something like that. And it was starting, I think, in 2015, something like that. So that's the moment where they start to be more in production and say, okay, we want to have more servers to also more and more things on the cloud. Because that's the way people want to deploy stuff now, they don't want to have bear me taught and so on. And so on. And that's the way they start to have more production ready cloud compared to the beginning 2010, because Evan opens like an inspiration wasn't that stable. Like, can I say that's a very, very long story. So which which version did you start deploying massively? Because like you mentioned 200,000 CPU cores, it's a pretty large deployment. Which which version did you start really ramping up? Was it very gradual? Or was there like a moment where we where you said, well, now this is this is the thing will will deploy everywhere. Let's go with this version. And that was the start of it. If I'm not wrong, it was starting around kilo one. And then the brain liberty of something like that, when they start to have around two, three, four hundred hyperverses in two, two main regions. And then this that the first main one was the one I'm aware of is there's some older stuff I'm not aware. And I'm not aware of the scale of them. I don't have all the details. And and for the and the next next they move to another generation. And where we have some region with the biggest we're used to be the biggest region with four hundred hyperverses. And that was in 2017, something like that. It was classic. If I'm not mistaken. Yeah, it was it was Newton and Queens, depending on the region. So those those actually are three of those regions that I mentioned we like to commission. Yeah, Queens based, right? One Queens and two Newton based. Okay. And what are the what is the latest open stack version you're deploying right now? In production, it's straight. Yeah. So most of our regions were deployed initially in Stein after like latest new regions. But recently, a couple of months ago, we finished full upgrade from Stein to train. And we plan we're planning to go further and actually have a quiet ambition goal to get to set up to the end of this calendar year, maybe the beginning of the next. But yeah, we definitely want to get as close to get as close to master as we can. So it means you're currently managing Newton, but also Stein and train. No, there is no Stein anymore. All stands are ready to train. Okay. Three, you still have three open stack different version in production right now. Yeah. Yes. And soon forward, we could have a new region in recent years for more multi-easy approach because of how our regions are single easy and we want to have multi-easy regions. But for this one, it will be in Wallaby. Okay. And yeah, but the idea is for all the region not already in Wallaby because we want to go to be for specific feature in multi-easy. I don't have that in mind. But the idea is for all those regions to catch up with the Wallaby version. At least, yeah. So idea is to get rid of the old deployments and at the first stage, align everything at the Wallaby level and then go beyond. Sounds like an interesting plan. We have shared components between these regions that you also need to ask. Keystone, Keystone is shared across all the regions. Yeah. Okay. Keystone is not in train. Keystone is not in train. Actually, Keystone, that's only service that you use in the rocket. So I can tell you that this upgrade path is quite easy. So we did that like a few months ago and it worked extremely well. Well, we will get there eventually. I'm pretty sure. So you've been on this for a very long time. Was there a pivotal moment where you are not confident that starting with this version, you can upgrade more easily? Was there a moment where the stability or how easy it is to upgrade was radically different that you feel a lot more confident going forward? Or is it still like every new release a new game of figuring things out? It's a complicated question because, well, with some services, there are almost never any problems with upgrades like load balancing activity. They're usually quite easy to upgrade, but services like Nova or Neutron, there might be problems. So of course, on the one hand, we have a stability question. So we could stay like with old release forever and then maintain it and we will be sure it's more or less stable. On the other hand, we have all those bug fixes, all the new features that we want to implement. So there is always internal fight between maintaining stability and the desire to get new features. So we're always trying to maintain the balance, but at the same time, we're in favor to get newer things. So even, yes, there are different moments and a lot of hesitations on how it would go. But still, at some point, we were willing to take some risk and proceed with it. And I believe if it's still necessary, we'll do it again, because we have our strategy of how to do it. And it worked well for the last time, I believe. We will be able to do it any other time that we need. Yeah, just to get a little bit of a story of upgrade, because back in some halls, the different stuff were not upgraded. We just deployed new infrastructure until we've done the upgrade from shine to train. And we've done it quite recently, actually. And globally, it's been, it was quite well. I wouldn't have a big issue with that. Only one problem. Only one, yeah. But it was not really directly raised to the upgrade, actually. It wasn't a happy coincidence, yes. Yes. We still love working Q. Can you tell us more about this? Yes, of course. Oh, we can, yeah. So it was upgrade on one of the biggest regions. And, you know, for the lack of safety, what we did during the Norway Newton upgrade, we shut down external API. So not just shut down, but we just redirect API calls from it to maintenance page, like, hey, guys, we're under maintenance. So just reduce the load. And to be able, if something goes really bad, something goes really wrong, it would give us the ability to restore the database from backup because there is no changes. No major changes as clients can do on operations. So we were at the moment when we were upgrading open with switch and open with which agent on the compute knows. And we started to get complaints like virtual instances started to lose network connectivity. And before that, we did the restart of rapid queue cluster, really minor change, minor change of configuration, more like to really cosmetic change. So there's nothing really important. And of course, we verified status of rapid queue deployment, everything was fine. But we did miss one thing, one of the rapid queue nodes started to throw in the log this error messages about crashed queues. So basically, a lot of open with switch agent related queues was crashed and open with switch agent after restart open with switch and agent weren't able to retrieve the information from the neutron and the restore connections. So you know, when you when you completely restart open with switch, it has to recreate all the all the connections, right? So in the is being done, but by the open with switch agents synchronize with the neutron server and they're doing it for that. So and then we got we got into the situation when a lot of computers just weren't able to recreate those connections and instances, a lot of future instances just lost connection, completely lost connection to network. So when we figured out what was the problem, I mean, it happened quite fast. So it didn't didn't take a lot of time. Yeah, but we just reset rapid queue cluster. And after that, like to 10, 15 minutes to restore everything. But still, it was a really unpleasant moment because we didn't really expect to have any problem with Ribot MQ without any real load since we disabled API. It's it's always harder with IBM Q. We never learn in a good in a good way. We always learn with the art path. Actually, it's. Yeah, actually. And rate stuff. What we've done to prevent that the other deployment, because we had also a vision to upgrade with some kind of size, we just perform some weight before each time you restart because like 30 seconds, just the times and the rest of the new when you restart everything is fine and it's stabilized that because the main issue with is was the restart of each written view was too quick. Because we use a tool called Colounceable to manage our deployment. And there's no pros between the Rabit MQ restart. And that's, yeah, that's one of the reasons we could crash is because the result was too fast. And there hasn't the ability to, to, to know which queue is, especially with the HQ. That's a problem with the mirror queue. Yeah, because when you I mean, Colanceable actually checked has kind of a checked on the status of the Rabbit MQ know, so it when the Rabbit MQ reports, it's up. It goes to another note and it started there. But base, it reports it's up, but it usually hasn't synchronized all the mirror gears yet. And it can happen. That's one QQ declare self master and the two not to not to the same time and then leads to this core queue to this mirror Q crash. We had a similar issue for us and we solved it by draining the mirror cues from the note before shutting it down. I added so that there seems to be a Rabbit MQ CDL feature that was built for that. But I moved it out like the code is still there, but like a command line function is gone. So that's some like custom code to trigger that. And that really helps. Otherwise, Quorum Qs, yes, yes to be a very good thing to go to. Yeah, clearly. That's totally on our radar. And we like in the, we want to start to look into it in the next couple of months. So that's actually a good, good transition. We have, we have one question from, from the audience. Since we started to look into the operations side of things. So that's a reminder that if you are watching the show live, you can ask questions and we'll try to answer them as much as we can. So we have a question on the YouTube channel. Yes. So what deployment strategy and what tools are you, are you using to manage such large infrastructure? I think you mentioned Kola and Siebel. Yes. So just to describe a little bit what is Kola and Siebel. Kola is a, is a large variety of tools, is a project inside OpenStack to containerize everything around OpenStack. So the idea is to deploy OpenStack APIs and those other stuff inside containers. Mainly we're using Docker, but can also be now OpenMan and all the stuff. And Kola and Siebel is a subgroup of this Kola project dedicated to deploy these containers on very tall directly using and Siebel. So the strategy we have to maintain the different nodes, install new nodes during operation, so on is using these tools and having and, and Siebel eventaries for each region and maintaining them over time. And we also have, of course, our custom patches and custom also rules for our bare metals bar metals configuration. We don't use everything from current Siebel on the sub part of it dedicated to OpenStack and the whole system configuration. We already have some historic rules and playbooks we continue to use now for our deployments. So I suspect you haven't started with Kola and Siebel because it's relatively, at the scale of OpenStack history, it's relatively recent. Did you, did you play with other types of deployment solutions before, before settling on Kola and Siebel? Previously it was directly, it wasn't Siebel or so, but directly with the options. Self-written Ansible playbooks. At least here in Ubisoft yet, I don't think anybody ever used like, at least in production, any other solution, like, I don't know, OpenStack Ansible, something like that now. So it's been like self-written playbooks and then when we started to massively deploy new regions with Stein release, at that moment we switched to Kola Ansible and that really helped us to deploy on the big scale. And as Benjamin mentioned, we mostly use it, but we have a bit of automation on top of that to manage different, I mean, I'm multi-regional deployments because it's not really... And the version too. Yeah, and the different versions. So you didn't migrate your old, let's say, pre-Kola Ansible environment to Kola Ansible, or is that still running on your custom-written Ansible playbooks? Those all regions weren't migrated to Kola Ansible. So they have nice old Ansible playbooks. Yes, yes. Yeah, that's, it's not for all services like Keystone or some more recent services, we use Kola Ansible, even for the old regions. But for the course one, like Novan, the Trons, Induraglens, yeah, we still use the old one. But we'll get rid of it soon. Yeah, I hope. So any other operational tooling that you're using to help with with the management and maintenance of those clusters, or is it just like straight Kola Ansible? We have like our special backdoor deployment salt. So in case something wrong with SSH and Ansible, we always have a way to do some emergency maintenance. So we have every hour server has installed salt minion. So we can perform, we prefer to not touch it, we prefer just to have it for the case when we cannot use Ansible Kola Ansible or SSH for any reasons. So you mean the salt is the secret ingredient? Absolutely. Did you ever need to use it? Not often. And well, some people still prefer to use it even though if it's not necessary. There were a couple small cases and we had problems with SSH access and I believe there was something with firewall misconfiguration. So it was quite useful, but it's like really plan B or even plan C if something goes really wrong. And how big is the team to manage this whole infrastructure? So we are dividing into small cases depending on the scope of all, we are not managing all the different services in one single team, but we are wrong. Ten? How many are us? I say we are around 20 people to do, to manage that. Are you including all the management? Yeah, including everyone, but on the admins, that's something like below 15. I would say 12, 12, 13. How many people are in duty from time to time? All of us actually. All the admins. So yeah, it's 12 people, right? Okay, it's not that big. It's a pretty small team for. How many hypervisors in global do you have? You said one of the biggest region is around 400, but I imagine not all the region are. The biggest region now is almost 500, but perhaps like 2,500 computers overall, I believe. Okay, that's 2,500, something like that. Is it deployed worldwide? Did you have some data center around the world? Absolutely. We have some in Asia, some in Europe, some in North America. Then all of them are running this, connecting to the same Keystone, right? Don't you have any latency issue based on the fact that Keystone is located at one place, but not the region? The database is always distributed all over the world. The database that backup Keystone, so Keystone database is like GEO distributed database. And is that in Galera cluster? So you're running in Galera cluster around the world? For now, yes. That's what we want to change quite soon. Well, I wouldn't have expected it to work. It actually works. It's been worse initially. It was deployed with too many cluster nodes. Then we reduced number of cluster nodes. I believe up to seven. So there are seven nodes and it works quite well. There are some problems with dead logs that we will fix. Somehow, I don't know, maybe by moving to another solution to another architecture. Right now, it's pretty much tolerable. Sometimes it gives us some pain, but it's manageable. It works. Yeah, okay. We actually have a new question from the audience more on the storage side. So how about Ceph for storage for the current Kola and Cibo? Are you using any kind of distributed storage? We don't use Ceph. Yeah, we don't use Ceph. Unfortunately. Yeah, unfortunately. For volumes, we use poor flex and produce names for the volumes and for local storage for the VMs. And we also have Swift object storage that is distributed around the world, but there is a separate team in charge of Swift. So it's been deployed and managed by us, but like a year ago, we handed it over to a separate team who is in charge of Swift deployment now. And for the power flex stuff, I just want to bring some clarity. So we use this proprietary solution because initially there was like strong intention to align all this storage solution between different things. So between us, VMware, and the other enterprise services, so all of them could use the same storage back end. So we just couldn't go with Ceph. VMware and the Microsoft, unfortunately, don't really like those kind of storage systems. So you are finally the only one using it? No, those teams use it as well. Okay, so, okay. And you said Swift is Swift used as a back-end for Glens, maybe? Yes, okay. And you said you have local storage for instances on computers. Is it problematic for live migration or stuff like that? Or do you avoid migrating instances? It's been a huge pain in the ass, to be honest. We improved it. It's better now. So right now we only have problems with really big virtual instances that have like terabytes of volumes with high IEO. That's still a problem to migrate them. But for example, if you take general virtual instances that has a local volume of 200 or 300 gigabytes, it's absolutely fine to do block migration. It works fine now. Yeah, so we have pretty much the same issues in our deployment. We are using local storage as you do. And yeah, if the disk is big, problems are bigger. So basically. Yeah, but at least with the latest QMU versions, it's going the way better than it used to be like with QMU 2.10, 2.12, it was awful before, but now with 6.27 versions, it's way better and way more stable. What is the underlying operating system that you are using? I think you rely on an operating system to install latest QMU version. Is it based on or is it compiled in the company? Well, our libvirt is, as we use called Ansible, so it's containerized. So libvirt is running inside the container. Now for libvirt, we use CentOS 8 stream. But for example, host operating system, hypervisor operating system is mostly everywhere, still CentOS 7. We're planning to migrate to Ubuntu. Yeah, but this call Ansible gives us a bit of flexibility to use different operating systems inside the container to get new versions. Is it the same for QMU? Sorry, QMU is running inside the Docker container as well? Yes, yeah, along with libvirt. Okay, yeah, there's some parameters you can put when you start a container to have privileged access and so on. And for instance QMU directly rated, it's a child of this TMD directly. In that way, if you restart your container, you are not affected. You don't restart your VMs. But yeah, that's the idea to have everything containerized. And the main advantage is you can have different versions for your services. For instance, we have a more recent version of Octavia on our system. We are currently in a series of moving to yoga. Part of the reasons are in the yoga Octavia already. We're going to finish it in a couple of weeks to upgrade Octavia everywhere to yoga. But that's, yeah, that's for more maintenance. And if actually the different services can have a different life cycle for their version, you can decide, okay, Octavia is way easier to upgrade. So you can upgrade it very easily without affecting those services. You just have to check if the communication between the services with the API is still working well. Normally that's the case, but you know, that can have a very slight case and where you have some bug fees in some version. With containers, you don't have this problem of library version compatibilities, comparing if you have all the services installed on the bare metal hosts. So we don't care about the library compatibility between services because all the libraries are packed inside the container. So I wanted to ask a question around downstream changes. Do you maintain patches over OpenStack versions that you carry over every upgrade? And is it something you could contribute back upstream? Or do you run just vanilla OpenStack deployment at this stage? We do have some patches. If it's possible, we're always trying to contribute to the upstream project. Usually those are fixes, a lot of fixes like bug fixes that we can manage by itself. So we're of course trying to push it to community. And quite often we get the merge and it's easier this way because in this case, we don't have to maintain it on our site. When it's merged upstream, well, it's not in the community. But we do have some patches, really specific patches just won't be merged to the upstream project because they carry really specific hour cases. Yeah, so we still maintain like a set of our internal patches, but it's not so big. Was it back bottom? Yeah, it's not too painful to maintain. It's sufficiently small that it's not too painful to maintain by your side. Depending on the kind of patch it actually, but yeah, actually when it's more related to Nova, Neutron, or main stuff, if it's a small patch, a small fix, it can be easy to backport because yeah, the bug is hit by everyone. Other, I think, to backport, in my opinion, is for example, self-related to crownsible because we have an opportunity way to use crownsible for some stuff and you can't backport these patches for crownsible. But if there is a bug fix, it's really easy because community of crownsible is usually really responsible. Give a fast and a nice response. So for example, the last patch we got it merged, it took two days since reporting back and get fixed merged to upstream. Okay. So for managing upgrades, I suspect you you're just using, are you dropping the complete version or are you upgrading using crownsible? Do you do any type of upgrades or is it more like like Arlo said earlier, dropping completely and rebuilding? We touched the subject already today. Last time we did complete upgrades, so we decided to go with the proper life cycle. So we upgraded everything from time to train. So that was a life upgrade, except that we just shut down external API for a couple of hours during the night and that's it. And then we could have actually avoided even that. We did it in the lack of safety. We could go without shutting down the API. So we could do it completely life. And maybe next time we will do that. It's mostly to upgrade the database, I think, which is where you need to shut down API to avoid writing in the middle of upgrading. Yeah. It was mostly for the case, if we would like to restore the database from back up, if something goes really wrong and we don't want to, you know, overwrite our configuration, they have those newly created resources during upgrade like dangled on the hypervisor because if we restore the database, the new resources won't be in this old version of the database. So it would be the whole mess to clean up after. That's actually the main reason. But the online upgrades for the databases, I think, were extremely nice. Yeah. At least what we tried to last few times for Keystone Advance. It works really nice without anything the user ever notices. Well, what I know also is for now, because the initial deployment, which wasn't started within form one version, is okay. But what we have to care about or take, have to be careful about in the next upgrade is the more you upgrade multiple times, if you've got four or five times, you can have maybe slides. If you have an issue, I don't know of a first migration, but it's not, it can be visible on the first migration. But the more you upgrade the database, the more you can see some glitches on the rows, on the tables. So that's, I think now it's way better compared to before. But yeah. Definitely better. Yeah. You should be careful about it. We faced such issues in our infrastructure, upgrading from, we did a big jump from Newton to Stein. And we had exactly what you described, something in the database, because we previously upgraded from Juno to Newton. And the infrastructure we upgraded from Newton to Stein, we're still having word things in the database that prevent Stein to running correctly. And because of this, we did a massive checkup on the database and we did a check on the constraint. And if the data is correctly set and a lot of stuff like this, we built a tool around making sure the data is correctly working for Stein. And that's kind of painful. In the past, I did upgrade from Newton to Roke and I remember I had quite similar problems, actually. Yeah. It wasn't at Ubisoft, it was in other place, but... So we actually have a new question from the audience around security updates and Kola and Siebel. So how do you handle potential security updates? Do you maintain that by yourself? Yeah, actually. It depends on the branch, actually. So for example, right now we have a new CVE to take care of. It's just been released this week. There's no CVE, so we need to add this fix. And since we're using Train, its fix is for Nova. So it won't be back ported to Nova branch. So we'll need to do it manually. We'll need to add custom patch to our Kola build pipeline. So it will be applied to the newly built containers. So we will be able to deploy those new containers. But if, for example, for that multi-AZ project that Benjamin mentioned, it uses Valaband. Valaband is pretty much supported. So you can just take this latest commit from this branch and build the image with it. Just deploy a new version of the container using Kola and Siebel. And just to build the image easier, we use the Kola tool that provides us all the stuff we need to build easily the containers. We don't take the directly one from the docker.io or don't recall which docker.org is used by OpenStack. But we build some ourselves with a dedicated pipeline for that. And yeah, we have our custom patch. We need to our custom configuration and so on. Can you share how long that will then take? Like for example, for the CVE of Tuesday to get it rolled out across all your regions? Well, I actually, I'm planning to start to work on it right after this. After this live stream, yes. So I'm pretty sure it will take to build and test everything in a lab day and then as a day to deploy it's everywhere. So it's definitely a quite fast, fast rollout process to get it in one day or two. It seems to be quite critical. So we're going to give it high priority. Because it is quite fun. Just yesterday I was able to reproduce it and I was able to get quite sensitive data from the hypervisor. So I really want to figure that really fast. Which make me another question for you guys. How do you monitor your infrastructure? Do you have any observability tool that you are using or what is your way to monitor your infrastructure? Yeah, we are using primitives to perform the monitoring of the infrastructures. So basically we are deploying what we call exporters in the primitives world. So we have the basic one that we call node exporters. But we also use the OpenStack exporters for having OpenStack metrics about the VM usage, volume usage and so on. And yeah, we are also rabbiting queue and HAProxy now bundled primitives exporter directly inside the program. So you just have to activate it into a configuration file and you can get it up and ready. And for the primitives infrastructure, actually we have the dedicated team in Ubisoft providing primitives cluster easily. So we have them to have one primitives cluster per region and then they have a very, they use a function called Victoria Metrics. It can have a very big infrastructure with all the metrics all over the world. And there holds their platform on our platform. So we kind of have it. She can have a problem here. But you know, we just place a monitoring cluster in the different regions. So if we lose the regions, we don't lose monitoring for this region. Because we have a lot of regions, we can distribute it this way around the world. Yeah, and for Victoria Metrics, this is a true code or other alternative of it. Anyway, the idea is to have a high returns to backend for the matrix. So if you lose one region or one node, you can have access to the matrix. Okay, nice. Then I would continue a little bit with the scaling topics. I think you already shared a story about RapidMQ. I would guess you have other scaling stories, maybe that are not about the RevitMQ. Everybody laughs at those. All right, RapidMQ is quite a tricky one. Yeah, but we're actually just a couple of words about RapidMQ. We just decided, you know, we just took, in every region, we just took three empty compute nodes and the move to RevitMQ there completely to resolve the problem. So it's just, it's right on the dedicated servers and they don't give us any more pain. Are you then actually running like one RevitMQ cluster for the entire region? Or are you running one RevitMQ for the server? We're running one RevitMQ cluster for the entire region. Yeah. We've been thinking actually to go into the direction to divide it per service, but right now we don't see much need in that. So it pretty much handles our load. But to have that in mind. Just as a hint, if you ever do that, take a look at the large scale docs because if you don't set some specific options, you will have very much fun. I'm pretty sure we will have a lot of fun in any case because it's right. We already have scrapped a different configuration from this Siege wiki pages with a different tweak and tunes performances. It's already in place for our info. It's just that at one point when you start to have 300 nodes and if you follow the classical Color Uncivil deployment with everything on the same nodes for the APIs and MayaDB and RevitMQ, when you start to have some load picks, it starts to be tricky for RevitMQ. And the one who starts to cry when everything is not going well is always RevitMQ. It doesn't like any loads on the nodes. That's why we moved first to other nodes. Yeah, go ahead. And just to say, for the new region, so we have to perform the multi-AZ to ensure redundancy and to split the load we have now. One node for APIs, one node for RevitMQ and one node for MayaDB in each zone. So we don't have this kind of issue anymore. Okay. So RevitMQ is our preferred topic for scaling issues. But there is another one that usually comes up pretty fast after that. And I would have thought it's... Yeah, I would have thought it's... You have IO constraints in your environment due to the type of workloads you're running. So what can you tell us about the issues you encountered scaling mutual up to the levels you're running right now? That's a good question. Well, we don't have as much problem as we could expect it. And I think one of the reasons is that we already mentioned that we have a lot of quite big virtual machines which actually reduce the overall number of them. So it is not so loaded. But we use old-school approach with dedicated network nodes with legacy L3 agents. So we have this obvious bottleneck, the network nodes. And we do have some problems. It's not even scaling problems, scaling problems, but performance problems with Open, with Siege. That it doesn't give us this way, this level of performance that we would like to get, especially on the 25G nodes. Because all of our new nodes are equipped with the 25G network cards. And well, right now Open, with Siege doesn't show us what we would like to see. I suspect you are running Open, with Siege driver with VXLAN tuniling or something like that. Yep. Yep. Exactly, yeah. Have you tried the VM? Only as a personal project in the lab. Okay, and for the MTAZ one, the architecture tried OVN, but we have some concerns because we were using BGP inside Neutron to advertise the private cell service network. And floating IPs. And floating IPs. And OVN, one of the tries that one or two years ago, OVN wasn't ready for BGP stuff. So that's why for the MTAZ. So you were a project for that? That works quite similar to the Neutron BGP agent or OVN. I didn't get your question. So if you want to try again, there is now a project that implements the BGP thing in a, let's say, similar way to how it works for OVN. So if you want to try again, you might want to check. We'll definitely check it out. But for this new region, it's been already decided to move on with DVR, with distributed virtual roads. So we're unfortunately at the end of our allocated time. We'll have maybe, thank you all for all this discussion. It was really very interesting to have a deeper look into how you actually operate those. And we learned a lot of interesting things around the number of people you actually use having to operate this kind of environment, which is kind of crazy when you think about it, how large it is versus how many people you actually managed to run it with. So I'll pass back to Kristin so that we can close this episode. But I would like to thank you again for joining today and having this wonderful conversation about your large-scale open-stack deployment. Thank you. Thank you for the invite. Thanks, Thierry. I want to thank all of our awesome speakers today. I appreciate you all for joining us and to the audience for asking some really great questions. If you like talks like this one, make sure you register to attend the Open Emperor Summit in Vancouver, June 13th through the 15th. Also, tune in on February 16th for a very informative episode from the OpenStack community titled NVIDIA GPU Management by OpenStack Nova and Cyborg. Remember that if you have any ideas for any future episodes, we want to hear from you. Make sure you submit your ideas to ideas.openemperor.live and maybe we'll see you on a future show. So mark your calendars. Hope you're all able to join us on February 16th at 15 UTC. Thanks again to today's guest and we hope to see you all on the next Open Emperor Live. Bye.