 Hello everyone, thanks for joining us and welcome to Open InfraLive. Open InfraLive is an interactive show sharing production case studies, open-source demos, industry conversations, and the latest updates from the global open infrastructure community. This show is made possible through the support of our valued members, so thanks to them. My name is Thierry Carras, and I will be your host for today's show. We are live streaming on YouTube, LinkedIn and Facebook, and we'll be answering your questions throughout the show, so feel free to drop your questions into the chat section of your streaming platform of choice, and we will answer as many as we can. Some of the most popular episodes on Open InfraLive are the large-scale OpenStack show, where operators of large-scale OpenStack deployments come and discuss operational challenges and solutions. Today, the large-scale OpenStack show is back for an ups deep dive into Schwartz Group's OpenStack deployment. You might better know the Schwartz Group under its popular retail brands like Lidl or Kaufman. They have a large and growing deployment of OpenStack to support those brands. Joining today, we have Belmiro Morera from early at CERN, Mohamed Nazer from VEX Host, Arnaud Morin from OVH Cloud, and they are joined by Felix Hütner and Etisham Ulhak from Schwartz Group. So to get it started, I'll hand it to you Belmiro. Thank you Thierry. Welcome everyone, but especially to Felix and Etisham. It's great to have you in the show to know more about the Open Infrastructure in the Schwartz Group. So let's get things started. I think the first question for you, both of you, is tell us more about you. So all of you are now in the Open Infrastructure field. Then maybe I would just start. My name is Felix Hütner. I basically started with the, let's say, OpenStack in Devo back in 2018, when we started Stackit as a brand and a whole idea of we build our own internal and maybe potentially some kind of public cloud environment and have been following OpenStack and all the associated things for that. So I'm Etisham and my journey with OpenStack basically started when I was writing my master thesis with Susan in 2019. And since the past two years, I have been part of Stackit as well to work on the infrastructure of OpenStack. All right, so both of you work on the Schwartz Group in the IT section of the Schwartz Group. So I didn't know the Schwartz Group, but I know some of the brands that belong to the Schwartz Group. Actually, I shop like every week in one of them, like one of very well-known, at least in Europe, is Lidl. But there are so many other brands, right, that this group has. Can you tell us more about the Schwartz Group? Yes, the Schwartz Group is definitely most known for retail brands, but we actually started a lot of other businesses. We have water-bottling businesses. We create our own chocolate, ice cream, some parts of meat, our own recycling companies and similar things like that. And since we have a large IT base, there was also a discussion. What do we do with that? Do we maybe want to open the cloud a bit for ourselves or for the public? Right, so at least now my understanding is, of course, Schwartz Group is the IT part, but you guys also open to outside, right? So you have your own infrastructure that can be used also by other customers. And you create your own cloud infrastructure. And do you name it clouded? And that's it. It's Stacket. Stacket, yes. And it's a child company of Schwartz. And basically it's a separate brand from Schwartz IT itself. So this Stacket is mainly focused on cloud platform and providing services to Schwartz IT and as well as public customers. Right, so as the large operations group here, so we are interested to know more about this infrastructure. So what you can tell us about to have your own infrastructure and not going to a public cloud, you needed to build all these infrastructure. I believe you see a lot of advantage of that. Can you let us know more about it? Yeah, the main advantage is the control and self-sufficiency we have using that and not relying on some potential external large provider where we don't have any possibility to fix things when they go wrong. And on the other hand, the IT part of the Schwartz Group is not that small. So there's definitely a commercial aspect playing into that as well where we see quite big benefits with why we also offer that to pop or why we also want to offer that to pop. And of course you are using OpenStack for this infrastructure, right? Yeah. So can you let us know more a little bit about the architecture that you guys are using? How you deploy it? What is behind of your infrastructure? Yeah, we started back in 2018 with a quite small scale infrastructure which had more control nodes than actual compute nodes, which was not that ideal back then, and scaled it up to something around 400 compute nodes, 60,000 cores and things like that. Right now it's all deployed using our custom build or self-developed deployment tool named Yahoo. We previously used a tool from external supplier for the control we have using something self-built. It's just a lot greater and a lot are extremely helpful for ourselves. I actually had some questions on that as a follow-up. So when you were taking the decision to use a deployment tool, obviously there's a couple of available ones that are kind of upstream and then decided to pursue your path. So what was your main reasons to why you wanted to pursue the path of building that tool? We saw few drawbacks with existing tools, especially regarding automation of, let's say, day-to-day things where our feeling was that a lot of these are based on at least some manual trigger or some manual starting of some kind of action. And what we built here and what we thought about was building Yahoo! based on a Kubernetes control plane and not building it using a deployment tool like Elm, which will just say, this is the infrastructure, I apply this now and I will pray it will stay correct and healthy for the next years. But rather using operators which will regularly reconcile states and can fix potential error conditions. And that thing paid back off quite a lot of times. I guess everyone knows the quite famous or infamous rabbit MQ which might sometimes not work as well as you would like it to have. It's always a topic rabbits in these shows. It's the greatest topic. But using operators, we actually made it quite easy to rebuild such a cluster even if it's a high-available cluster. So we could just say we delete the data, we delete the currently running processes and operators will fill it back up all users that are in there within just a few minutes. Of course, this means that you will lose some messages, but you don't care about those as operating a large scale. And honestly, all messages that we lose were from Neutron because the Neutron rabbit MQ is the one that always dies if you split it out. So there was not that much time lost there. You always have the issue that you might lose information needed for things like billing and stuff like that, which is something where we needed to build some recent logic along. Do you still deploy the rabbit MQ in a cluster mode or only single process? Yeah, it's H8 cluster combined with TLS on the cluster protocol layer, which is something that I feel is not that common because it's a little broken. And we quite regularly had issues until I think four or five weeks ago. We basically could count on it that rabbit MQ dies on Friday afternoon. We don't know why, but it always died on Friday afternoon. But back then we found basically an option in rabbit MQ regarding CPU scheduling, which brought it now to running for four or five weeks completely stable. And we are extremely confused that this option is a pretty bold and that we took so long to actually find it. I would say how happy that it is. At this point, yeah. So I saw your presentation that you gave in the open infra in Berlin this year. It was a great presentation. And you wrote some numbers there about the size of your infrastructure. It would be great if you can let us know more about it. And that's because they were relatively large numbers. So you are managing these like a huge open stack cluster or then you have several clusters and then you aggregate all these numbers. At the moment, it's one large open stack cluster with all the 400 or so compute nodes in it. We are thinking about splitting that up in the future because while it can probably scale a little larger, there is definitely at some point a limit where we don't want to go above. And on the other hand, we are thinking about also how to put that into our regions or failure domains that we want to offer to our users. Because the common failure domain of AC being some kind of data center like a physical location is from our perspective not always the most helpful one if you then have one large open stack deployment over because probably the open stack deployment dies a lot more often than the physical data center breaks down. So we are thinking about actually redefining one open stack cluster as being one availability zone here. Right, so having different open stack clusters and then you define different availability zones on them, right? Yeah, basically we say this open stack cluster is availability zone one or something. Right, okay. So you're not going to use AC, you're going to use regions instead of AC, right? Yeah, yeah. Which brings quite a few challenges like no stretch to networks and things like that. Yeah. But it's the path we believe it's currently the best one but it's also not yet in existence. So maybe we still find blockers on the road. Okay. It's interesting. I think we have a question. No, I don't think so. No, okay. We actually have one question now. One question from Kurt Garloff maybe. Perfect. So do you experience scaling challenges with Neutron with more than 400 compute hosts using OVAs? That's what Kurt has seen in the past. So do you encounter those limitations? I would say definitely yes. But NOS is definitely a scaling challenge. There's not necessarily a hard limitation on what you can scale up to. But I think a large amount of that scaling challenge then comes from the amount of update messages running through RapidMQ, which just increased to a crazy number. And I think that we have found out that that might be quite unique to our environment is remote security groups. We have a user base that at least in some part uses quite large amounts of remote security groups, like referring to 2,000 and more ports. And that definitely slows down the Neutron Open Research agent when it needs to load this information for the first time. And yeah, just to add to that, I mean, with so many compute nodes, you have more customers that are there and more routers. And we had some issues with a limited number of gateway nodes, which basically killed our gateway nodes. So we have to extend them and distribute the routers throughout them evenly. So we can accommodate for all those compute nodes and all the routers from the customers. Are you using the DVR stuff on the trend side? No. So it weighs our revenue. I think as an OpenStark large-scale users as we are, we always reach these kind of problems, right? One of them was RapidMQ that we already touched, and the other is Neutron Scalability. Yeah, and maybe we need to find a way to try to overcome these issues because they are always popping up, right, in our discussions. Well, the cool thing about the RapidMQ site is there was a recent effort that was surfaced and it started with a discussion in the OpenStack discussion mailing list. And there's some efforts of a group of individuals that are trying to work together in order to implement a NATS driver. So NATS is an alternative messaging Q system. It kind of relies on, I think it's a CNC, I think, Cubated project. So it's gotten a lot of traction behind it and there's a lot of big users for it. So I think there's an initial draft for the Oslo messaging driver that needs tests and the spec needs to work on and everything. But if anyone's watching that's kind of interested, I suggest kind of looking at that effort because I think that's, you know, it could definitely be a solution for a lot of these issues that we're seeing. But on the other hand on this, I don't want to say that I'm skeptic because I encourage this. I think it's for it. But I think one of the community members once brought this up and I think it was in a TC meeting. I think it was Dan Smith and he said, at least with RapidMQ, we know how it breaks. We have a lot of experience on how it breaks and potentially something else. We don't know how it breaks. And so maybe taking the thing that breaks, the thing that we know how it breaks is better than taking the thing that we don't know how it breaks yet. Yeah, which is completely fair, right? It's two hard choices. Yeah, but also that's a completely different technology. We cannot comparatively one-to-one to RapidMQ. So probably there are a lot of use cases for people that use now RapidMQ that that's not to be an option, right? So the liability of sending the messages that the RapidMQ offers on ads, that will be more difficult. Yeah. So at least our idea to solve this problem is to basically get rid of revenue and go towards OVN. So at least we don't, at least with Neutron, we don't have these issues with messaging at all. So right now, yeah, we are going in that direction. We are implementing the OVN integration into Yauq. So we can move towards more RapidMQ at least for Neutron because that's the most painful one. For others, yeah, we also contributed and we are willing to contribute to the NetRiver or replacement for RapidMQ. So let's see how it goes. I think you tried on your side the QuorumQs on RapidMQ, right? We are. So we are building up a new environment because our existing environment is based on the nice and old version of Queens. And we like to get something not as horribly old as Queens. So we are building a second environment currently based on yoga and that will have QuorumQs enabled. Unfortunately, we can't enable it for older environments just because it's not there in also messaging. But until now, we didn't do low tests on it honestly yet. So let's see how well it behaves. At least it doesn't behave radically worse. That's still good. We hope it's already something quite possible. It's supposed to be the answer. But on the question of OVN, I wanted to bring this up because that's something we've looked at. But I think one of the concerns with OVN is, do you guys feel that it's still at feature parity? Is it going to support all your use cases? So for example, something that we see often is customers that need VPN as a service. And VPN as a service is not something that's available with OVN right now. So for example, that's one sample of something that kind of stuck out to me. I don't know if there's other things that are kind of slowly making their way in to reach that feature parity. But do you guys feel the disadvantages of losing some of the features of OVS since it's been around for so long are useful in return of the scaling advantages of the OVN driver? I think we see this definitely. We are heavy use of the neutral dynamic routing plugin to propagate the addresses of our routers using PTP. That's not yet there for OVN. There's a project that was also shown at a summit. I think it was the last session that was on the summit at all which showed the initial version of the project and we are now trying to make that work for us and potentially contribute things there. On the other hand, for the existing logic we have for OVN we have tried to get it to break in some kind of way and it's from our perspective significantly harder to break and significantly easier to restore than OVS in one. We all alone have for a Kunde OVSL3 agent if it needs to start up and single routers might take an hour or so and with OVN we are at some minutes at most and I guess there's a significant benefit also in not sending all these messages at all but definitely it's a big trade-off and we are for example no user of the OVN as a service plugin but what we see as a benefit since we say we want to have multiple open stack clusters in some kind of way we will have the requirement to connect them in one way or the other and OVN has already a solution for it it's an OVN interconnect project that allows you to interconnect different OVN environments so it's a trade-off that might be worth it from our perspective. I'm not saying subject actually I'm curious since you're running Kubernetes and open stack together which is something that we do do you guys focus on putting the infrastructure services also in Kubernetes such as like OVN or OVS or Libvert or do you actually run those directly on the host? They are also let's say let's maybe shortly address the question it's still a little incomplete I was like an hour earlier did this I was literally working on a patch for OVN to get IPv6 on external networks to correct your work because they don't announce back addresses on failover but I think on the overlay it works extremely well please we didn't find anything at least I forgot the previous question it was about running Libvert OVS on the host versus the container so basically yeah we have we are like with our current OVN integration all of it will be running inside containers and with the Libvert as well we are running it in the containers so basically whatever Yonk deposit all of it in containers the big benefit that we see with that is our deployment tool just needs to get the node into Kubernetes the rest happens then via some kind of magic but what we do is at least for Libvert people at OpenStack Helm had a really cool feature to let the VMs escape out of the pods so that even if the Libvert pod or container dies we are fine Yeah so that's how we use OpenStack Helm for that challenge so it looks like you're on the same route so while we're on the topic of OVN we have a question from Florian so you've mentioned that you would consider moving to OVN have you already worked on some migration path like are you going to build it from scratch are you making them coexist or is there another solution that you've considered Okay so basically there are two things we are bringing in one is we want to migrate to a yoga version and at the same time we combine it with OVN and basically the part we are taking here is that we will do an offline migration for all the projects one by one from one cloud to the other so it will be not online, it will be offline and we will migrate all of them from our current cloud to the new one that we will be running on yoga as well as with OVN so we would not use OVNs anymore I look it up knowing I'm like man it would be nice if we had that privilege but I think as public cloud operators we will never have that privilege just like shutting down everything it took us some convincing to get it so yeah the basic rational being that we if we think about I think it's eight upgrade steps from Queens to yoga our assumption is that something will probably mess up somewhere either we or some kind of bug we will hit and we will probably have a larger downtime with upgrading individual versions than with shutting individual projects on and migrating them yeah yeah I think it's time to go back to yahooq and try to understand you already explained a little bit about the motivation of yahooq you guys developed your own open stack life cycle management system courageous to do that but then you end up not only managing OpenStack but also managing your Kubernetes clusters for to run OpenStack on top of it and I'm curious about all these hops involved on this because a large scale OpenStack is challenging to operate the Kubernetes as well so how is your day-to-day operating these two projects at least from my perspective there's a lot of things that went into automating for example things like node deployments nodes coming up initially registering themselves in some kind of management being deployed by Ironic getting a gigantic large config drive with Ansible that actually deploys Kubernetes on them and turns them into a cluster and that's something that works extremely well automatically always the general quirk of I'm some kind of hardware I want to behave differently than everyone else sometimes they feel a little bit like cats they just do what they want but afterwards it's a extremely smooth thing so we have just a general inventory network as a tooling and we just say this is a compute node and it gets deployed as a compute node it gets as a compute node and the operators take care of the rest of actually deploying either the neutral research agent or deploying OVN and then deploying over compute and all that magic including monitoring and setup like that well the way you say it looks like very simple and that everything is always working right? so in terms of upgrades because then you don't need to you need to upgrade not only OpenStack but also Kubernetes clusters I guess that will be a big challenge for you our current plan our current idea on addressing these is we not only have Kubernetes we also have the host below it with all its nice and fancy system packages that might have vulnerabilities purchase data and things like that and our idea is to actually a complex path but to optimize our node deployment process so that we basically say we can now maybe take one or two of these 400 compute nodes and just deploy them wipe the disk and deploy them completely fresh with the current version of Linux on it potentially including a Kubernetes upgrade for this node as far as possible and then going from there so the idea is not to manage these things individually but rather include them in the life cycle of the node and making this life cycle extremely short okay I see so it means you are able to live migrate a lot of instances in order to empty the compute so you are using a lot live migration yeah and no issue with live migration no daily operation no ops for that we had some issues especially regarding the remote security proof we mentioned earlier because that caused the unusual open bespatch edge to hang and that means that newly live migrated VMs couldn't get their pod plugged and they are offline for 5-10 minutes but other than that it works quite well for Queens we have a lot of let's say work around logic in there to make sure it works smooth so for example we so in Libre you have the possibility to say how fast we actually live migrate so we turned it down to an extremely low number and then wait for Nova and Neutron to actually figure out everything on the destination host automatically check if everything on the destination is fine like if the pods are plugged if the security groups are applied and only then release the migration speed and let the migration actually run so you have some custom patch in Nova for this more in our live migration logic so it's a tool or a bot or something around Nova to manage that it's basically a part of your that cares about emptying nodes and ensuring that it can do maintenance on these and that has grown quite complex over the time and what about things like just hardware incompatibilities different kernel versions different the other stuff where it comes to live migration issues or you haven't had too much problems with that kernel versions were actually nice to us for now but we had we pinned the CPU versions of what we offered to our users so that we can actually be sure that what we offer is also available on all hosts participating in this given host aggregate so that we can be sure we can live migrate all instances around which at some part makes it non ideal as we might not be able to offer the let's say latest CPU features to our users all the time or only in maybe specialized aggregates and flavors on the other hand it allows us just to do this kind of maintenance but we will have for example meet GPUs coming up quite soon which makes the live migration idea no longer that valid at least for these hosts and for your for your live migrations are your VMs using all block storage only are they some block storage some local storage is it all local storage what's your mix on that the most part is on the on the center block storage we have a small amount of things that we run on local storage that being especially load balancers that we are you that are used by the Kubernetes service we are offering because there's a lot of them and because the image that they are using is just tiny and it's not worth the overhead of provisioning it on block storage but it's just copying around one gigabyte of disk which is not really changing at all so it's quite easy to that is that is easy live migration if you have shared storage it's easy live migration like my experience personally I love live migration and it works great but I don't have shared storage so we are transferring like 200 gigabytes images around and that is sometimes it doesn't go well yeah we are not we definitely have the requirements from our users to offer such things as local storage in the future just because of the amount of performance it offers and we are not yet sure how we will handle that now I think we have a new question yes we have a question from LinkedIn user who asked don't you face issues with HCD when using central block storage? I'm feeling a little bit confused if I was going to guess you mentioned you had a Kubernetes service if I was going to guess it's probably for your customers running their Kubernetes clusters and the HCD for their Kubernetes clusters that's going to run in the block storage that sounds possible they use quite a lot of IOPs and we have a different quality of service we offer to our users and they are normally in the larger region of the do you have are you using what sort of drives are you using for your MSME set cluster? no it's actually a a NetApp based storage we are heavily benefiting from the D application we had a set cluster to pass and is it using like NVMEs or SSDs or normal drives? it's a fun combination of NVMEs and SSDs correct in our case we haven't seen too many issues but we run NVMEs for our block storage so maybe that's why we haven't seen any issues but I think that's a good question depending on how you use centralized storage it might be pretty slow we definitely needed to figure out a little bit what performance class we actually need to give to the NCDs because they they are definitely right sometimes side question kind of as we're discussing so how are you building your images for are you using an existing OpenStack projects images for the containers or are you building on your own? we are currently building them on our own we had a fun discussion on the summit regarding actually combining that with the OpenStack upstream effort but there's not yet much progress in that direction there's a few of them with one or two custom patches for example the neutral open research agent has a patch to show the status of itself so that we have some reflection of the status in Kubernetes for example if it can't finish iterations because of some error because they take forever so are you building as virtual environments or are you installing just all the packages from the provider that you're using or how are you kind of like so installed from source or from distro packages and how do you manage that? it's installed from source basically and then into the different stable branches R-E-O-L for some things from Queens got it there is a motivation to not use a lock key for example because I think Mohammed you are using lock key no in your images we used to use lock key but then we kind of like wrote our own little thing mainly because what we started using is using a lot of the new features of Docker so using things like multi-stage build using the new front-end features so we can get clone directly from Docker so there's a bunch of stuff that we did around that but essentially the API of building our images is still the same as lock key I can still use the bind up idea to get all the binary dependencies but yeah that's I guess a question also for them is was there a reason for that or no particular reason I remember taking a look at lock key back then and I'm not exactly sure but we had issues getting it to to fit quite well for us and we needed to customize the images anyway because we wanted to put in like style-up scripts or something like that I know from OpenStack Helm they are mounted from quite interesting config maps that contain a large amount of scripts and we wanted to actually bake these into the image that makes sense and the OpenStack installation part is honestly then quite small in the images and so you built the OpenStack images yourselves everything all the manifests are managed by the operator and what about things like the infrastructure tooling and when I say infrastructure tooling I have an MQ I guess Galera cluster or whatever form of database did you kind of have all that code did you rely on maybe some Helm charts already do that so what was kind of how did you end up with that deployment aspect of the artwork we actually started with the Helm charts and started using just an official rabbit MQ and Galera Helm charts but we quickly came to the point that they don't necessarily support what what we want to do on the one hand we wanted to have a TLS everywhere which a lot of Helm charts might support for frontend traffic but definitely not for replication traffic and we wanted to have the possibility to use custom logic during upgrades, during recovery like the Galera cluster or something like that which is extremely hard with the Helm chart let's see and sorry moment this is interesting because you manage also your databases with your lifecycle tool that's it at least for me that was always risky to do I don't at least in the way that we manage the databases they are always in a corner we have very little automation touching schemas and so on to avoid any issue it's interesting that you guys feel comfortable to have this in your lifecycle management I don't know if the others can comment Arnaud, you are smiling no, I just think just like you said it's kind of risky yeah because for us we had bad experience at the beginning when we had the puppet models always touching the databases we decided let's stop this and we only touched them manually we actually had not that positive experience with automated updates on database as well but what we have as a benefit is that the operators they have an integration test logic so we actually for each change we do now we spin up a whole OpenStack cluster including Galera clusters, RabbitMQ clusters and that allows us to catch the issues we would normally face with wrong images or wrong content files and stuff like that alright interesting and then what about monitoring so I noticed that in the dependencies list you guys need Prometheus so I'm assuming that is part of the monitoring infrastructure so can you talk a little bit more about how you guys did monitoring both kind of internally and inside your operator tool yeah, that's something not completely part of it we have multiple Prometheus environments running that are federated between each other or doing remote rides so that we basically in the end have one large Prometheus environment where we can query wrong metrics from all clusters and all environments including down to the CPU usage of a single core or on a given network I'm not sure how long that will scale I guess we will find the end of that sometime do you always say you'll always find out something that scales to a point you'll find out when you know you'll know when you know sorry I'm not going ahead I was just going to ask if you have some custom exporter for open stack metrics or anything fancy on that part or is it too regular exporter that we can find upstream it's an exporter for open stack we query a lot of database itself because querying the API for such a large amount of resources from our experience not that stable and especially not fast yeah so that's what everybody is doing I think yeah well something I've actually been thinking about that is like in the one day we can do if we have the time but I think what could be really interesting is if we kind of have people move towards using something like UWSGI in a more standard way there's something like UWSGI exporter that could probably start to track some interesting metrics on how long requests are taking how many requests are coming back with like a 404 or 400 or 200 and that way we can get some measurable data out of that so that's kind of one thing that I think is neat the other thing is using MySQL exporters to generate stats so for example something that we have been thinking about is monitoring for example the instance faults table in NOVA and using that to generate scheduling failures or failures that a VM didn't go up for and then that way you can alert on that because it's much easier to catch that in the database by monitoring it ideally would be nice if NOVA or something like that exposed that directly but that seems like a much more less intrusive thing to do then going into the code and starting up a service an arbitrary port on all these extra open stack services that are running so something I've kind of been thinking about and I don't know what everybody else thinks about these ideas we actually do that I think it produces quite valuable things especially to so we mentioned live migration earlier there's always the fun instance that has random states and that you can't easily live migrate and that's the best location for us to figure that out we get a metric for instance live migration errors and kind of basically take a look on that yeah actually we do the same and we started this like long long ago so we wrote a little tool that we call the blogger like the database blogger that goes frequently to the database to extract some metrics and some information that is not only for statistics but also for alerts for the infrastructure and we are running this like almost from the beginning because from the beginning we felt the need to have a tool like this to for our operations it works very well for us I would say we have a lot of alerting based on these queries from database so we have a query spoiler which and then I think we also have some explorer from the gateway nodes for exploiting some important routers directly from the gateways but in combination with then we check how it looks like in the database and how it's in reality and then based on that we create some alerts and metrics it's interesting that even in a different way we are always doing very similar things right it's interesting about routers you said you were monitoring them because the database is not always reflecting what is on the network node right it could be that there are some zombie routers and it could be the other way around as well so we just want to make sure that both of them are in sync and if not then let us know it's kind of similar we had issues with placement as well placement yeah we had similar issues that the resource allocation there was not the same as normally doing it so we have some alerting for that as well so yeah we don't know why it gets out of sync but at least we have some monitoring for that and yeah we see yeah I think you just say it you're going to say exactly what I was going to say in the next in yoga for sure a lot of placement issues will be solved maybe that's what you were going to say and there is also a tool so I had initially we had the same issues I wrote a whole script to just go and clean that up and one of our contributors which I think was he took that and brought it into the nova managed tool so now in the nova managed tool there's actually a placement kind of db audit command which will go and make sure the placement and nova is fully synced up and potentially clean up anything that shouldn't be there that sounds like something to check out yeah clean up one more horrible Tyson script maybe that you cannot run an old open stack version for that you need to run one off the latest that has not been a thing for us and just a last question about the routers are you using the HA with kipelavd in order to have active and passive which works mostly well unless the network nodes are under a large amount of load and it's operation we had similar issue exactly because there is a lot of race condition between netron and kipelavd and between kipelavd itself and you can end up with a lot of routers not being correctly configured or even being correctly configured but not correctly set in the database so you think you have an issue but you don't one of the things we did for a while is we have an exporter that goes and extracts the HA status for the kipelavd process whether it's active or standby and then we would kind of have alerts if there's a router that is standby in kipelavd and all of them so even though the database can say that it's active or standby or whatever the reality is kipelavd is saying it's standby across all of the network nodes that it's covering so we would alert on that because that network does not have a single active active router so the VIP is not anywhere we had the similar but the other way around we have some multiple active routers and we had the alerting for that I'm wondering how much of that has to do with just like at the end of the day we are kind of using a bunch of x86 boxes to send a lot of traffic through and that's just kind of historically known to be something that they're not great at obviously we've got like DPDK hardware acceleration things that are available but I think for the most part a lot of the virtual routers don't really make use of those and because of that we're seeing a lot of CPU usage and a lot of these throughput issues and I don't know if you guys think that maybe having some hardware acceleration especially when it comes on network nodes would actually bring in a lot of benefits to this sort of thing there was an interesting session on the summit from I think Canon and Nvidia regarding the new DPUs Nvidia is planning to offer which seem to have some kind of I'm not sure if it's integrating or we switch directly or if it's just an OBN integration but that seems to do hardware offloading for example external networks and we're thinking about actually checking that out once they are also available I'm not sure what's the status for that yeah I know the Kinect X5 and the Kinect X6 also do the excellent offloading which is I think a big thing that'll take a bunch but I don't know if that's the tell all fix all thing especially on the network nodes where it has to do NATs and has to do all these other features that have to be implemented inside of the kernel level right like there's IP tables that have to be jumped through to get the NAT to work and all the other stuff that comes along so maybe that's something that OBN does better at at least it doesn't need to go through IP tables and not through networking and things like that so I guess there's a benefit and the DPUs that we saw at the summit were actually doing all of this in hardware how well this actually works I think is a question to be answered right so we heard everything about your architecture how you deploy things how good it is so maybe now to know things that didn't went as planned I think will be very interesting to close the show so who deleted the database let us know I think that's why they wrote an operator so they can say the operator deleted the database which was the computer not me let's give it a name but yeah so I mean we had issues with when we were migrating our environment from the previous configuration management tool to a Yoke basically and one of the bigger ones I would say was migration I mean no surprises there and basically the problem we had there was that the version of we are running MariaDB with our Yoke operators and on our old environment we were running an older version of MySQL and when we migrated basically there was a flag in MariaDB which was by default enabled which is basically called a derived merge and basically we have a lot of RBAC rules and there was one query in Neutron we still didn't like this derived merge flag because basically this flag was introduced for some optimization but in our case it was basically the opposite and yeah so that created a big outage basically and it took us some time to figure out what's the actual difference between our old environment and new environment and go to all the places and see what's new and what needed to be fixed so that was one interesting one I would say and the other one that must have been very difficult to debug like one flag and causing issues especially in the database that is challenging I guess we were literally going through all the parameters and seeing what's the difference are and what are enabled, what are disabled because other than that our workload was the same our Neutron versions were the same so basically the only difference was that MariaDB and MySQL versions were different and we were not expecting that there will be a flag which will be enabled by default which makes sense that it's optimization but basically the problem is the way Neutron wrote the query it didn't work and there is actually an upstream merge request for it which is still in the room for fixing this query and make sure that it works with too many arbiters basically and it's optimized yeah we spent the whole night there it was not so great I think that is a great lesson like when you change something in your environment you should do it in steps if you have changed everything upgrading changing the database that will be like a nightmare to find because then you can filter that this must be the database something wrong there and you can start investigating basically we did this migration on our stagings basically the production workload the query was not so optimized for our production workload it only triggers if you have a bunch of other tools for the same network but for different projects and only then the optimizer actually changes something there and that was really not something we were expecting but Felix you said that you have another one another one another one for databases that's at least I think two and a half years old or so but the issue we were facing there were basically the database servers were on a large amount of loads basically at a limit of what they could handle and the database just itself was falling apart randomly the issue basically being the SMI or Galera replication thread drying and it just didn't get cpu time because the normal minestrel deep processes just took out the cpu time the fix was equally easy after finding it you can actually set a thread priority for that and that ensures just this replication thread always gets cpu time preferred over everything else and it also helped us we had a change I think two days ago we changed the MTU of our external network thereby touching I think every single router we have and creating a database load of I think 220 I think top set when the server had like 60 cores but thanks to that setting at least the database cluster survived even though it couldn't answer any query anymore all right I was expecting something dramatic like an ostrich like for one week or we deleted the database makes it was there one day we had a fire in the data center like actually not you're saying you're saying things and I don't want you to manifest things here can we manifest something else like an easy upgrade or something other than these so thank you so much guys you have any other question to close the show all right I think that's it thank you for coming along guys yeah thank you so much yeah thanks thanks very much that was really a great discussion I wish we had more time but we are almost out of time so I want to thank you all for joining us today it's always a pleasure thanks again to the Open Infra Foundation members for making this show possible and thanks to our audience for some great questions during this show as you may know OpenStack Z will be released next week so we have another great Open Infra Live episode lined up that I'm super excited about we'll have members of OpenStack deployment teams present their recent work and the exciting features that shipped with this release lots of technical content again lots of opportunities for live questions about it so make sure to subscribe to this channel on your preferred platform also remember that if you have an idea for a future episode we want to hear from you so submit your ideas at ideas.openinfra.live and maybe we'll see you on a future show so mark your calendars hope you're all able to join us on Thursday October 6th at 15 UTC again to today's guests and see you all next week on Open Infra Live