 Thanks for joining us and welcome to Open InfraLive, the Open Infrastructure Foundation's weekly, hour-long interactive show, sharing production case studies, open source demos, industry conversations, and the latest updates from the global Open Infrastructure community. We are live here every Thursday at 14UTC streaming on YouTube, LinkedIn, and Facebook. My name is Kendall Nelson from the Open Infrastructure Foundation, and I'll be your host for today. First, I would like to thank the Open Infrastructure Foundation Platinum, Gold, and Silver members for supporting the Open Information and making this show possible. Like I mentioned, we're streaming live and we'll also be answering questions throughout the show, so feel free to drop questions into the comments section on whichever service you happen to be watching us today. And throughout the show, we will try to answer as many as we can. Today's episode is hosted by Thelmiro from CERN, Mohamed from Back's Host, Jean from LINE, and they will also be joined by some lovely folks from OVH Cloud, Arnaud and Xavier. So, without further ado, let's get started. I'll hand it off to you, Thelmiro. Thank you, Kendall. So, hello, everyone. Welcome to the Open Infra-Live large-scale episode. Today we are trying something new, the Ops deep dive. And today and during the next large-scale episodes, we will talk with operators from large-scale deployments and learn everything about the challenges that they face every day and the worst stories and much, much more. So, I'm Belmiro Moreira. I'm a cloud architect at CERN, the European Organization for Nuclear Research, and I will be one of the hosts for today's show. With me, I have Mohamed Nazer from Back's Host, a well-known member of the OpenStack community. Hello, Mohamed. And also, Jeannie, a cloud operator at Line that recently joined the OpenStack with Million Core Club. Hello, Jeannie. Hi. I'm really excited for today's episode. It's our first episode in this new series. And our first guest for this new series of large-scale episodes is Arnaud Moran and Xavier Nicole from OVH Cloud. Welcome both. It's great to have you here. Hello. Before we start, I would like to remind everyone that is watching that you can also participate with questions using the chat. We'll do our best to bring your questions to the show. So, without further ado, let's start drilling our guests with questions. I will have the freedom to start with questions. So, OpenStack now is more than 10 years old. Definitely, we are all involved with OpenStack for a long, long time. Do you still remember when you started with OpenStack? Yes. I can start if you want, Xavier. Go ahead. We start back in 2012 with a product based on Swift. So, we started with the OpenStack Swift and we deployed a product named Ubik. And based on that, we decided to move forward and deploy our first lab for the compute, for Nova, Neutron, et cetera. And this was called Runabove at this moment. And two years later, so in 2014, we started our first public cloud region. And then we deployed a lot of regions since. So, yeah. It's been quite a long time now. Eight years for public cloud. Do you still remember the release when you started? Yeah, we started with on Runabove. I don't remember exactly, but something like Chris Lee or something like that. And our first public cloud was based on Juno. Okay. Yeah. Speaking of release names, I think something that I recently learned was like, the second ever OpenStack release, which is spelled B-E-X-A-R for the longest time. I would pronounce it as Bexar, but it turns out that it's pronounced as Bear. I mean, like, that's a little tidbit of history, but that's really interesting that you guys started rolling in from such a long time. But, so like, what's the- Not Bexar. Yeah, not Bexar. But I'm curious, what's the oldest release of OpenStack cloud that is still like the original deployment that has been upgraded across like many, many, many times over and over? So the first region we started burned because it was in Strasbourg, but we still had a few old regions. For example, we have one in France, which started in Juno. We upgraded it in Newton. So we did fast forward upgrade. So we switched some of the fashion and we switched from Juno to Newton. And then we are going to upgrade it to Stain later. No, okay. In a few weeks. So, yeah, it's still running. It's still running, but we upgraded them, yeah. And we will keep doing that by upgrading after that. I mean, we try to accelerate the running, the upgrade process through time after the migration to Stain. That's the goal. Yeah. Yeah, because Stain is definitely already end of life and we are quite pushing an old release on production, but it's based on Stain because we have our custom plugins, custom development, custom things. So it's not pure Stain. It's Stain plus extra stuff, which I pick a lot from Extreme as well. So we try to keep more release. It's like a Frankenstein release. Actually, it's not pure Stain. Yeah. Wait, and at the end, it depends on the module because we have so many other stuff that are in higher version in Stain. Basically, we say we are running on Stain because Neutron and Nova are running on Stain, but for the other module, it can be something different. Yeah, actually that is curious. Yeah, so this means that you have different OpenStack projects in different releases, right? Absolutely. We have a similar model actually at CERN. So we have projects like Keystone or Glance that are running really the releases, very close to upstream. And then Nova and Neutron that are in Stain as well. Yeah, that's the other things to upgrade. So it's maybe one. Yeah, I personally think Nova is actually a lot easier to upgrade. I think Neutron is the one that, since it involves data plane and whatnot, that's usually the one that's a bit more like Harry to upgrade in general. I see Jean is kind of nodding. I don't know if that feels the same about that as well. Well, that depends in the driver that you are using, right? Yeah, probably. Yeah. In our case, we are using Linux Bridge, which it's quite transparent. The most difficult part is that we try to keep Neutron and Nova running the same version because there is a lot of communication between them. Then we have the Neutron agent running the compute nodes. So we try to have them in the same version. Yeah, that's exactly it. Neutron and Nova are very together, so we have to keep them running the quite same version. On the other side, we use a custom Neutron plugins based on OpenVswitch plugins. So we are not, we are really confident when we need to upgrade Neutron because we know how it works. We know how to manage it and it's not a big deal actually. It's the other thing is for my point of view, the database itself. When we need to upgrade the database and everything related, it's more complex. Okay, really? That is interesting. So you mean upgrading my SQL version? No, not even the schema of the database. Yeah, the schema of the database. Migrations. Between online migrations, offline migrations, things that will be broken because we introduced custom change in the database or you have to test it a lot. You have to take into account every info could be different. Yeah, OpenStack, Nova expects are hard to manage to keep consistent between different release. So yeah, that's a lot of stuff behind the database itself. Actually, you mentioned a very interesting thing. You mentioned the online migrations, you use them because I really try to avoid them. I really try to have offline migrations only because they are predictable and consistent. We do only offline migration but we execute the online migration during the offline upgrade. We start with small, yeah. Yeah, so that's maybe not what we called online migration. It's basically offline, it's full of line. And that makes sense for big deployments because you want to be consistent, right? You want to make sure that you did all the migrations at this point and you don't want to wait for the users to do some operations and things being migrated over time and then you'll never know the state, right? Yeah, exactly. We want consistency. Yeah. Well, speaking of orchestration, I'm curious how do you orchestrate the deployments then or how do you kind of execute upgrades and just bunch of that bunch, you know? So that's all playbooks, do you just kind of pop it and change some stuff and let the CI kind of run through it, is it? So you can run books? So it depends on which part we talk about but if we can't talk about the database itself, we have this playbook, we have a non-several playbook which is doing most of the stuff and at the end, the playbook is also using Docker containers to start every OpenStack version in a different container doing the migration, doing the live migration of the database and at the end, the playbook is bringing back the MySQL server. So it's quite fully automated. Yeah, it works pretty much correctly straight forward but it takes some times, of course. Well, it's interesting using Docker to start up a version or you're just using it to, I guess, execute the database migrations for the intermediate release. So one of the sometimes, we've done upgrades where we jump releases and I think what we've kind of done is we take it service by service and we look at the database migrations. A lot of times it's like, only every four or five or six releases that the project decides to kind of merge all the existing migrations into one big migration and then starting from scratch. So we've actually done some of these upgrades where we just will run the database migration on the latest release and it just kind of goes from the original migration to the newest one and just jumps all the releases in the same one but that's for the more simpler services like Keystone or whatever. And I probably wouldn't do that with like Nova Neutron to be honest. You know what they like. I like the services, the projects that don't do any DB work and keep the same database came over time. Those are the best. Yeah, definitely. Keystone is great. Glance is a very good example. Even the little changes that they do, we are able to do upgrades without any downtime to our users, which is great. Glance Sider is quite easy as well. Neutron databases are so easy, but Nova is our best for sure. Yeah. But one of the challenges I find with Nova is the fact that it actually also uses multiple databases. And so sometimes you're trying to do these operations and you're kind of like going to match out your API, you know, database with your cell database and it's just like, it's a bit of a pick sometimes. Wait until you use 40 cells. It's great. It will be even harder. And there is a lot of data also in Nova database which make the upgrade even more complex. It's not the case for Neutron, for example, because in Neutron the data is not kept. It's erased. Right, it's the soft delete, right? Yeah. Are you guys always pruning the databases? What do you mean? For example, soft deletes, are you removing them frequently? Yeah, we do. We deployed, we developed actually a tool named OSArchiver, which is open source available on GitHub. And it's basically the tool we use to delete, not actually delete, but move the data out of the running database, out of the production database, in order to keep it the smallest possible, but still working. And we do that every day. It's running in a cron job every day. So with that, we keep Nova database, but other database as well, quite reasonable size. Maybe you want to add something about this? No, the idea here is to maintain the database with only the data we are actually using, that's for sure, and not waste memory or storage that are obviously not interesting and speed up all the APIs. That's the goal is about the speed we answer the customers. That's the only thing. We need to keep this on track. I mean, we need to have archive about this. That's why we are archiving everything. But apart that, it's just a process of cleaning the database that should be done anyway by the project, but it's not done today. So we have developed this tool for that. Yeah. So actually, I have a double feeling about the soft delete because I actually find it's very useful sometimes. So we also trim our database frequently, the Nova one, but also we keep like two weeks period where we keep all the entries because sometimes logs are not enough. So then you can go back into the database to try to debug some issues. And that is quite useful. And in new projects like Neutron or Placement, we don't have that possibility. So we only need to rely on logs, which sometimes is a little bit tricky. Yeah, totally true. Yeah. I think for kind of us, and I would kind of put, probably of each in the same category. It's like when you're in the public cloud space, there's a lot of kind of compliance and liability stuff. And like you absolutely need to know what was launched at any given moment because you can go back. And I guess if you're kind of operating more of a private cloud, there's probably not that much of an interest in like someone knocking at your door, many months down the line saying, I need to know all the VMs that this person launched and how long they use them for and all this other stuff. So I think it's kind of like one of those things where we're kind of, it's another use for not archiving things. It doesn't mean that you want to keep it in your running database all the time, because that would be a huge pain also. But well, speaking of databases, I guess we'd be safe to assume you guys also are running Galera for database. Yeah, yes, we do. And do you guys, do you front-end it by like HAProxy? Have you tried things like proxy SQL in front of it? We try many things. Yeah. Because let's say that right now we are using Galera, but it's, we're not using Galera for what it's supposed to be doing. We're just using Galera because it's helped us to maintain a master-slave basic replication in a way that we can operate it the best. I mean, the simplest way, let's say that, because we all know that maintaining replication in MySQL world, it's okay when it works, but when we have to recover an incident, then it can be a pain in the ass. And so yeah, that's why we're using Galera because it's easier for us to operate, but it's too much for what it's supposed to do, and it's not working for what it's supposed to do. We cannot have a multi-master cluster that actually running the right way. So we are trying many other things. We are basically right now testing some MaxScale to see how we can manage that better. We are trying to put this in Kubernetes too. We're trying many, many things to operate the different region the best way and to avoid any downtime, and so far we are running okay. We have no downtime in the database in all the region we have, but yeah, we want to improve operation around the database. That's very true. So I'm really curious about this because you said that you don't have downtime into your different regions, right? But that is due that your failover actually works pretty well or actually you don't have issues into the database. No, the failover is working. We are often recovering nodes from the cluster, but we're not introducing any downtime. That's the situation. We use a chip proxy in front of the Galera cluster with a specific backend configuration so that every MySQL request that needs to write something will be done on one specific node and read will be done across all other nodes. And with this, it works pretty much correctly. It's, I think, maybe Xavier, you can correct me, but I think this is quite of the perfect, a better configuration that we had compared to what we had before. That's interesting. I didn't know that a chip proxy could look at the actual queries. No, it's not. No, it's not. We have different endpoints. That's the way we are made. Yeah. And we use, you know, in the Toronto Nova, you have this slave connection parameter that you can use. So we use that. It's not always working correctly, but most of the time, it redirects the read request on the correct place. I think something in our case that we started deploying was proxy SQL, which is, I believe, an open project. And what's really nice about it is it fully operates on the application layer. And so it actually runs like it can understand the MySQL protocol. So kind of think of it the same way you have an HTTP load balancer. It gets the request that dispatches it. It kind of does the same thing. And so what's really nice about it is it has native Galera integration. So it can actually talk to all the Galera nodes and actually know what their status is exactly. Like if one of them is desync, because it's a donor node, it'll automatically reroute queries and traffic. But because it's running on the application layer, HAProxy, some of the things that I kind of struggle with it is that it'll just drop the TCP connection and just hope that the other part reconnects. Whereas with this, it doesn't drop it. It realizes that the endpoint is not responding. So it takes the query and redispatches it to another node, for example. So it's really neat how it just transparently does all of that. And another kind of nice feature about it is, well, yeah, like the failover stuff. And yeah, so we've been trying it and it's working very nicely. Oh, and also it has rules where you can say any query that matches like a select send it to the non master nodes and anything that matches an update, send it to the master one. So it's a pretty interesting thing and we've kind of been playing with it and it's been working pretty nicely. Let's say we have that in our labs too. We're not super happy with the result right now because we're trying to integrate this with Cuba at the same time. Not super. And we have this on one side, on the other side we are trying max scale and the third option will be staying with HAProxy but have other mechanism of high availability provided by Cuba. That's the three way we're looking at it right now. So I have a little bit of like a two part question. Mentioned earlier a little bit about using a variety of open source projects, not just OpenStack talked about Kubernetes and your own in-house project that you rolled out as an open source project. But I'm actually interested in what other open source technologies you use and what sort of upstream involvement you have in any of them. I'm really interested to hear from Lion and OVH in particular. Jean, if you wanna go first. Yeah, so for OpenStack parts, actually one of the projects that we are contributing to is also metrics, which we use to actually monitor our database operations and the messaging operations in our OpenStack cluster, which as a lot of people have an issue with GraphMQ, so we are trying to be able to be notified about messaging issues as soon as possible when it's happened. Yeah, so, and also as I'm talking here, and we also contribute to large scale sake by adding some common documents on how we do our large scale deployment for other OpenStack operators be able to know when they're trying to scale out their clusters. Xavier, I don't know. On our side, you know, there's, well, of course there's different open source project that we're using. For instance, our block storage is running on self. We are trying to introduce more and more and more every single day, Kubernetes orchestration in all our infrastructure. And it's not only on public cloud side, it's global to OVH today. And, well, we're using our board too on the, as another open source technology. And, I mean, recently for the last two, three years, we're observing our platform with Prometheus Thanos infrastructure that is worldwide, very scalable, you know, because of course we're facing the scale issue about the infrastructure to collect billions of metrics every single minute. So it's a, it was a massive challenge, but we are observing super, I mean, we're very precise on what we measure right now. And it's extremely powerful to increase and observe the quality of the service that we deliver to customers. And it helped us to target the right way to improve our service. And it was really, really, really interesting to do that work. So we have this running now and there's many other technology that, you know, we have acquired some block storage company, like Extend, and we're gonna open source this technology as soon as it's ready, you know, this kind of stuff. And for the last months, our founder, Octave Claba and our CTO, Jerry Swish, has announced that we are developing a full open, I mean, a full open source cloud stack that we are starting to work on it. So the community is super excited about it. We are super excited about it. And specifically, I know myself, another member of the public cloud team within OVH. We are going to come in a few weeks months with some announcement around that, but there's a real deep strategy about what Octave, our founder, wants to do. It's really in the DNA of the company to share what we're doing. And we always believe that we are smarter when we are a group of people from different industry working on the same thing. And we would not be there if we all don't think that, I guess. So yeah, no, it's maybe for the last years, the involvement of OVH was not good enough in the company, but Octave always told us that as soon as we are going on the public market and we can raise enough money to do that properly, we will do it. And what happened basically in October, being October, we introduced OVH on the public market in Europe and the Euronext. And a few weeks later or even a few days later, the announcements were made to say, okay, we are going to develop this open cloud stack, open source stack about cloud computing engineering. It's not only about open stack, open stack cannot answer 100% of the needs of an hyperscaler like us. So, but of course there will be blocks of software that are gonna come from the stack. Obviously we will not reinvent everything and we just want to improve the global things and we're gonna invest a lot into that. And we want to onboard many other industry in this project, that's our goal. Awesome, it's very exciting. It is, actually. Yeah, it sounds like you're developing another cloud platform. So, I would like to ask as there is one pain point in our upgrades is that we have a lot of patches to the open stack itself to kind of provide some features that we need internally or to actually works with some of the internal systems. So, do you have this kind of modifications in your cloud? Yes, yes we have, of course. And that's painful. What we try to do is we try to push the patch upstream as well when it's possible. Sometimes it's not because we are running an old release and it's done differently upstream or it's not even relevant upstream anymore. Sometimes it's relevant and sometimes we do push something upstream which is not what we run but which is going to be what we will run in a few months or years. And when we can't, when it's pure internal stuff we try to decoderate the code we have with the upstream code by, you know, just not right directly in natural code but have a specific external plugin which can be plugged onto a different natural version for example. We try to do that. We also have this strategy which is we rebase or commits on top of upstream each time we pull from upstream. So we don't hide or commit in the githi story. And we can, with mechanism like that we can just be precise on what we need and what can be removed for more specific codes because it's already fixed upstream or because it's not needed anymore. So this is a strategy. It's not easy of course, but I don't have any, if you have maybe a good answer for that or a good strategy, another strategy we would like to embrace it. Yeah, we would like to do it more and what I can say is from our perspective it's like sometimes we think that our needs or the patch we would like to introduce or the feature we would like to introduce do not correspond to what the community would like to go because we have specific needs as an update scatter and public one, which is another part of it. So yeah, sometimes it's a bit complicated to maintain this because at the end open stack is not only made for doing large scale public cloud at the end. So we are facing many challenges on that. And yeah, that's maybe the main pain point about going upstream. I can totally relate to that. Yeah, we suffer exactly the same. I think the other strategy is what Mohamed does that he's always running the latest release, right? We're going on that. Yeah, and we are still in Stein. Being involved upstream more and more, it's nice. It's nice, but see it's not always easy. It depends on how your organization is working. And then the end upgrading an infrastructure that had hundreds of millions of virtual machines and hosts, it's not something you can do every six months. So basically, even if we try, I mean, for instance, and that's why we decorate the version of the module, for instance, if I can give a quick example, right now we are deploying a new Kistone infrastructure and it's running on Xenna. So, you know, we are okay, but our Octavia infrastructure is working on Victoria. It's gonna run on Xenna very soon, but so far it's not. You know, and at the end, we are coming back to the core, upgrading Nova and Neutron, complicated and cannot be done every six months, impossible. Yeah. So we are already in the middle of the episodes and we didn't talk about OVH, right? We assume that everyone knew what OVH is. When we think about public clouds, at least me, I usually think about the three big public clouds, but to the truth is, there is many others and OVH is also one of the biggest public clouds, at least in Europe, but you guys have presence around the world, right? Yeah. Can you tell us more about OVH? Sure. Your soul. First thing first, OVH is a 20-year-old company. So we exist for a very long time. It has been founded in France by Octave Claban in 1999. We basically, I mean, I will not go through 20 years of evolution of OVH, but I can simplify this by, let's say, cycle of five years. First, the startup, very beginning in 1999. Octave in the garage, going to very, very innovative and introduced at the very beginning some water cooling system in his server to be able to have the best operation costs around this. And that was the very first thing that in 2004, after exploding in France, literally, we open offices all over Europe in Poland, in Spain, Germany, Italy, Portugal, I don't remember all of them, but yeah, let's go for the next five years all over Europe. That's what we did. We opened other data centers, other friends. The next cycle, the third cycle of five years, we diversified the offer, basically. We, it's when we introduced, because at the very beginning we basically offer web hosting and dedicated servers. That's what's happening. We've been always a cloud company before it exists because basically you were able to order your servers through an API, even it was not branded or marketed that way. So after that, 2010 to 2015, introduction of object storage, the public cloud, the private cloud, we have a strong partnership with VMware at that time and still running now on the private cloud part. So we do all this and we introduce, we start our business here in Montreal. We introduced the North American market back in 2011. We opened, which is still one of the biggest data center in North America, here in Montreal and a few kilometers from Montreal in Boarnois. And yeah, we keep developing that way. Then after that, we introduced in the North American market in the US in the next five years. We acquire some companies. We raise a lot of money with these other investors coming up, coming in and we arrive to EGN. So after that, we're in the years where we can actually have our dark fiber loop all over the planet. We can connect the two fiber together. The backbone is worldwide. We have today, if I'm not wrong, about three data centers all over the planet. We are running on the public cloud part but 900,000 cores, if I'm correct. We're gonna reach the one million core. We're gonna reach the one million club this year. I mean, I can tell you that. Thank you. We're not yet there, but yeah, it was the, I mean, today the growth of VH is partially but in the important part, parted by the public cloud part. So yeah, we are growing fast. And yeah, we always want to be different from the others. That's the former personal reason why I joined OVH because I joined OVH three years ago. So I'm kind of junior here but it was to be able to have an open platform for our customers. That's super important for us. It's why we are waking up every morning and going to work. We don't want to be underlocking. We want to be reversible. We want to be resilient. We want to, our customer to stay because they are happy, not because they cannot go somewhere else. Basically, that's what we want to do. So, you know, it's what we do for the last 20 years. And I mean, from my personal point of view, I started my career at the same time as Octav Klaba in 1999. Internet at that time, you know, where it was, we can still all make the sound of a modem connecting internet and it was that. So today, I'm super proud to be the operation director for what is a truly open alternative to all the vendor looking public cloud operator. That's what we do in the pre-forts. And that is very inspiring when you look back to 1999 and where you guys are today, it's very inspiring. I'm picking up on what you said about being open and open source. I remember like five years ago or so in one of the OpenStack summits where you had a presentation where you show that you assemble your own servers and many other presentations like GPUs. And at that time, we at CERN were looking at GPUs and you guys were already presenting how you configure GPUs. And actually that was a big help for us to understand pass-through and the way you guys were doing it. Let's start with the assembly of the servers. So OVH, yeah, it's a part I have not mentioned because it's not on my responsibility and I should have. It's OVH is operating, of course, the service but we are a vertical company here. Meaning we are, our vendors are Intel and Qualcomm and whatever, you know, we're just buying the parts and we have two factories in the world that works only for us. So it's part of the group. One is in France, the other one is in here in Montreal. We can assemble whatever we want and it's specific to OVH. Meaning that we have the best cost of operation there. It's of course 100% of water cool. So that can be tricky sometimes, mostly with the GPUs but it helps us a lot in different crisis. Like for instance, if we can speak about the Strasbourg crisis, we were able to reassemble and deliver thousands of servers within few days. The two factories was working only for deploying servers all over our data centers to get all the traffic and use it somewhere that lost their service. We were able to, I mean, at some point if you go to a traditional vendor to say, okay, I want to buy you 10,000 server for tomorrow, they will love that you say, hey, hey, nope. That's not gonna happen. But because we are vertical, because we have our own factory, we'll do that. And even during the current crisis that is worldwide, not only for us about getting the parts for building our servers, we are more flexible on what we do and we can deliver better than the others. It's difficult for everyone. Let's not, I mean, let's face the truth but we can do great here. And even, I mean, the quantity of servers we need every single week cannot stop increasing. We are improving our delivery of that. It's still complicated. We have issue with our vendors for some parts, of course, but we can do well and we can refurbish our server. We can change them. We can upgrade them live. We are fully autonomous on that. Well, I see that as a great advantage, right? You guys have been our client. Yeah. It saved us for many, I mean, and give us, not even save us, but it gave us a massive advantage compared to competitors. I mean, when, for instance, some of our American competitors in Europe basically at the beginning of 2021 saying, okay, guys, I cannot sell you nothing because they have no fucking hardware available. We're still about to deliver. Yeah, no, it's crazy. I mean, these days, if you can get hardware in three to four weeks, you're absolutely having a great day right now. That's your living your best life, but I've heard lead times from things like Melanox or also known as NVIDIA now for some of their NICs, which is like, we've got 30 to 40 week lead time to get cards and some of the vendors will literally, they'll stop selling you individual parts because they're like, well, we need these to do our full builds. So if you just want to replace NIC in a system, like we're not selling you that. You want NIC, you buy a whole system. It's ridiculous, but I guess it's just they're trying to manage their supply chains and make sure that they're taking care of their own business as well. So it's a complicated situation that I guess we're all struggling with the cloud and not the cloud as well. And you know, there's some other stuff that are, I mean, it's still confidential, but again, it can speak about it because it's public that we're working on that. Today the market tries to get more power and mostly in the GP world on the different, I mean, on different configuration of racks. And we have been working for years on a immersive server and to cool down and operate servers like that. And we are at the head of development. I mean, we are literally going faster and better than all of competitors. And we're gonna announce very, very impressive stuff very soon. Yeah, I think one of the advantages that I see you having is since you control your whole data centers and they're kind of, they're funky to start with, you know, like you have water running already into your racks and you're like, well, we already have water there, you know. It's not like you go to your data center vendor and say, hi, I'd like some water pipes to be ran into my cabinets is absolutely no way anybody's accommodating that. That's for sure. Absolutely. I know it's, it's a truly interesting. And even that for the racks, we don't have regular racks. We have our specific rack and to improve our development models. For instance, everybody in the world who use vertical racks, we don't, we have horizontal racks. We are operating 48 years, but originally. That means we have rack one, when we work three and soon we're gonna have rack four, rack five, rack six on the top of it to increase our density and reduce our footprint. You know, and all goes with the water cooling system, the immersion of servers, just to make it better and more efficient and reduce our PUE. You know, we have one of the best one already on the market and we keep working to reduce that. So going back to a comment that I think you made Xavier earlier. So you were like, oh, it's impossible to upgrade every six months. There's been ongoing discussion, like forever basically, but more renewed recently about release cadence. Everyone's favorite topic. And I know the release team, their standpoint is like, we can switch to longer or shorter. Right now we're at six months. We can move to like a one year release cadence and that they also presented to the TC that a good time to do that would maybe be when we restart the alphabet. So we're currently developing, geez, X, Y, Z, yoga release. And that'll be coming out in a few months and then we'll start on whatever the Z release is going to be named. So do you have thoughts? Would you be interested in it moving to a one year cadence? Do you want it to stay at a six month cadence? We have no interest that it stay at six months. We know we strongly believe that having a slower slower currency, it can be useful for us but at the same time it's just for us. We are jumping literally four to five version every time we upgrade so far. We just, I mean right now we're moving to a period the entire infrastructure on Qube to ease the process of migration and all this kind of stuff. We expect to be able to decrylate more the versions of the different models we're using, different projects. Yeah, at the end, upgrading an infrastructure that had millions of cores, one time a year, it's already a massive challenge. A massive challenge. It's very easy to start a new region with the last version. That's no problem at all. No problem. If one day we end up having a business that we don't have for the moment to offer a private cloud region for customer specific that he's buying basically as host and running in his infrastructure on it, yeah, it's gonna be super easy to have the last version running very fast. But so far we have to upgrade millions of servers and that takes time anyway because you basically need to move the data. You basically need to move the workload. We do not allow, we allow interruption in the APIs but of course we do not allow interruption on the workload of the customers. So yeah, that's something that takes time. So one year release cadence would be great for you? Of course, yeah. Got it. I've always kind of voiced them and I think most people are gonna guess more on that with this but I think the way that I see it is like, it's kind of a double-edged sword, right? Because if you increase, if you do releases less often then those releases will be far more featureful and much bigger deltas, right? Like in terms of checkpoints. So like, you know, a year's worth of features, changes and whatnot can really become a bit more painful rather than for example, sometimes a project will actually split the transition to something over a couple of releases and then like that transition going from a year and a half in six months becomes a three-year transition. Yeah. Which is why I've kind of been very like, you know, suggesting something like how, for example, the Ubuntu world kind of rolls, which is you've got your releases every, you know, I think they do six or eight months or whatever. But then every now and then, every four releases, you have an LTS release that is maintained for a longer period of time and, you know, testing upgrades from an LTS to an LTS should be supported. And then you're kind of your little point releases are supported for a shorter period of time. So, you know, if you want to be like us and you want to have always the latest stuff, you know, you can still do that as long as you commit to continuously upgrading that way you're not putting a burden on the community to maintain a bunch of, you know, a bunch of releases for a long time. But if you're like, you know, saying, okay, once a year as much as I would like to, then you kind of stick to the LTS releases that would have a longer support period, which means things are cherry-picked to them, things are kind of moved back to them. And you kind of, you end up getting the best of both worlds in that kind of scenario. And yes, that's got to be as expensive, a bit more alphabet usage and potentially a bit more branch maintained. But if we kind of cut down the life of a branch, if there's an LTS, it's like, if you want long term, go LTS, if you want to upgrade often and get the latest, greatest, go non-LTS. And I don't know, I think it's a good balance, but obviously, you know, there's, there's gotta be some stuff involved. It's not that straight forward. Yeah, I don't have a strong opinion on this, actually. I see your point, Mohammed. I think it makes totally sense, but also I see the dev side that is quite difficult then to have these LTS, right? And we always complain about this really cycle, mainly for Nova and Neutron, and not for the small projects, right? Because those are easy to upgrade. So probably I can give an example. So we upgraded very recently Glance because we wanted a feature that we want to coat us. And if we needed to wait one year to get to the release, that for us will create some concerns. You see this dual thing, that's why I don't have a strong opinion about the cadence of the releases. That's been a good thing, right? And I feel like that might discourage some kind of upstream contributions. You're like, I want to get this feature in and you're kind of working on it and whatnot. And it's like, okay, I merged it. See you in a year when this thing will become released. Then you'll have a lot more people starting to roll kind of downstream and forks. And people already do that, but the deltas will be much bigger. Yeah. There are definitely pros and cons to one year, six months, or if you want to get real crazy and do three months or nine or whatever. The question is also about maintaining the different releases. I mean, if you go with the LTS, it's good because it means you will maintain one specific version for a long time. And it's good for us, for example, because for now we are already, we deploy really old release and if we can cherry pick some commits easily, it's good for us. So yeah, if we release one per year, it could help also maintaining some release longer. Yeah. I mean, right now in OpenStack, there's three maintained releases already. So we can even cut that down to one LTS and one current and one, like two six months releases and an LTS. But again, that is one year and a half, right? It's not that much for Nova Neutron. Yeah, I guess that's also a good point. I think mainly my idea is that we have a supported upgrade path. Like you cannot land something that breaks the upgrade path to go from, let's say, Victoria to SANA or Victoria to yoga, for example. And that would kind of make life on those one year upgrades a lot easier because, you know, the upstream just kind of have to kind of keep that in mind, I guess. Yeah. So we are reaching the end of the episode almost. So I think we should start with the tough questions. I think we're going to make their life easier and tell them to tell us what they can tell us. Ask whatever you want. We'll see what we can do. So OpenStack, for you, what is the biggest pain point and could not be upgrades? You need to come up with something else. So, yeah, we... I can start if you want to. Yeah, sure. I'm sure I have different points. Yeah, I'm pretty sure. First big point we had in the past, it was about a rabbit, but we already talked about a rabbit cluster, maintaining the rabbit up and running. And it's complex because we are not used to manage rabbits. When it works, it's nice, but nobody knows exactly what's behind rabbits or it works inside rabbits, except if you hire someone from... I don't remember the company behind rabbits, but anyway. So it's better now because we upgraded rabbit to a later release and we also worked a lot on working on the policy and how OpenStack managed cues, how messages are going through. And the work you're doing on Gene, on Oslo Metrics is also a good thing, I think, that needs to be developed more because it will help definitely managing the rabbit and cue clusters. And yeah, that's one of the main pain points we had in the past. It's not really the case anymore. It works like a term now, but yeah, one of them was a rabbit. We also had issues with DB because, well, MySQL or Galera clustering, if you are not a DBA, you end up with a situation where you don't know what's happening as well. We also eat bugs based on a specific MySQL version, not MySQL, MariaDB actually, and upgrading usually fix the bugs, but until you've found that you are affected by a specific bug, you still have issues, so yeah, it's kind of... And what is common between Rabbit and MariaDB is that it's a central point for all OpenStack services. So if you have an issue on one of these two elements, you have an issue on almost all OpenStack. So one of the things we are currently working on as well is to split the different Rabbit and MySQL clusters by services, and we did not in the past, and I think it was an architectural decision that was maybe not good, but we are moving to split everything and I think it's better to manage that in that way. And maybe if I can add an extra pain point, it's about managing natural agents. I'm pretty sure everyone here already figured out that natural agents are doing a lot of RPC calls, asking a lot of things on RPC servers, and it's consuming a lot of MySQL resources at the end into doing a lot of database SQL requests. And when you need to restart every agent on a region, you have to pay attention on what you're going to do because you may end up in a situation where you are basically doing a DDoS attack on your Rabbit and MySQL. So this is something we understood by experience and this is something you have to take into account. It's one of the pain points we had. Yeah, we definitely even have to develop tools that allow us to have elasticity on the database when we have to restart a whole region. That we need 10 or 20 more times of reading notes in the database to handle all the requests. So, okay, we can do it, but it's a pain point. I can add stuff on my side that is more like at the scale level right now. It's like, how are we going to handle today? I told you we have 30 data centers. We have 40-ish regions, a bit more than 40 regions. So we have literally 40 control planes of the stack to split our different workloads and compute nodes in groups, in resources and groups. Tomorrow we need to find a way to federate all those regions. And there was stuff that was started in the OpenStack world around that, but it's clearly not, let's say, production ready or even operation. We cannot operate that kind of how it was imagined. We're going to deep dive into that, like we have to. We're in the position that we have to do it and we will invest time on this, but this is definitely something that we will have to find the lead on that in the community and provide solutions for everyone because it's a problem that we are facing as an enterprise killer. We want to be transparent for our customer. We want to deliver the last version, the last feature to every single customer as fast as possible. So, I mean, at the same time, it's with the upgrade issue. We can have, with Federation of Region, have the last version of everything running on the last region and open to every single customer with the same, with the federated end of tier in here. And maybe you don't know, but there's a product that is transversal to every single product at OVH named VRAC that allow a Layer 2 network to connect every single product from a customer in his project. So, literally with this system, as soon as you can connect different resources in different region, have the Federation of Region on the top of that, then you can even upgrade faster because you can have just a single last region that is available in the data center that provide the last feature for every single current customer you have. So, there's two aspects on this. Well, I think if you're going to explore that, I've definitely thought about it, never had the time to think less about it, but I think in one ideal world, you know, Keystone doesn't exist and whatever you use as an internal authentication stuff would generate signed JWT tokens that have the tenant name and the project ID and the roles and everything and then that's passed on to the OpenStack services and then we can do offline authentication. You can have Google authentication, you can have authentication with whatever you need, as long as you just have a signed token with the right stuff in there and I think that would be really, really neat but I think that's a very, very, very long-reach view. Yeah, we're definitely investing time into that because we have to right now. Yeah. Yeah, I have something to add with RabiMQ as it seems to be a pinpoint for everyone. We are currently developing currently also messaging plugins that do not use RabiMQ but we're currently still in development so hopefully we'll have good replacement for RabiMQ. Awesome. That was awesome. All right. I think we are at the end, though. Yeah. Thank you so much. I think it was great. Thank you for having us. I think it will be also interesting to receive some comments and some feedback for the people that watched this episode if this is a format that they would like us to keep for the next large-scale episodes. Thank you so much. I don't know. Thank you so much. Thank you for having us and I wanted to kind of appreciate some of the comments that Thomas also left out in the chat as well. He put some really, really interesting points so happy to have them participate. Yeah. Thank you for having me and thanks OVH for providing some great insight for all the operators. Thank you all. Thanks. Thank you. Thank you to all of our awesome speakers today. You're all amazing. Appreciate you joining us and also the audience like Muhammad mentioned for the comments throughout the show. So in two weeks, we have another awesome episode lined up that we're really, really excited about. Mark Collier will actually be hosting that episode and it'll be around discussing the Linux OpenStack Kubernetes infrastructure or as we have dubbed it, Loki. And it will be definitely be one not to miss. So make sure you subscribe on your preferred platform. We stream to what was it? Facebook, YouTube and LinkedIn. So make sure you mark your calendars for that. I am also very excited to remind you that we are returning to Berlin from June 7th through 9th. And if you're interested in speaking, the CFP deadline is in less than a week. So get your submissions in by February 9th to the URL is CFP.openinfra.dev. So make your submissions there. Also, as always, we want to hear from you with regards to Open Infra Live episodes. So submit your ideas at ideas.openinfra.live and maybe we'll see you on a future show joining this lovely panel or some other excellent episode. Mark your calendars. Hope you're all able to join us on Thursday, February 17th at 15 UTC. And thanks again to today's guests. See you next time.