 in two different ways is because in the case of RabbitM2 works well, it doesn't mean that all your OpenSEC RPCs and notification is working fine so that we start monitoring RPC calls and notification from OpenSEC site also. Yeah, so below is some of the reason that that may cause RPC reply timeout. So first one is maybe the message is lost in RabbitM2 or the message is delayed in the RabbitM2 cluster which is the fault for RabbitM2 but you can see for if an RPC server has an exception or it took a very long time to process message it will also cut reply timeouts at the client site and the latter two is basically not RabbitM2's issues. So this is also the reason that we try to monitor RPC calls from two different ways. Yeah, so from monitoring RabbitM2, we use RabbitM2 exporter. We're not using the official one but another open source one on GitHub but there is also an official RabbitM2 exporter that you can use and the main metrics that we are looking into are some of the message process status that's things like a number of queues and number of queued messages and number of connections to the RabbitM2 and also some node specific metrics such as the file descriptor usage, the memory status, partition status and if the RabbitM2 node is up or running or not. Yeah, so these are the metrics that we're currently monitoring and we have also set up alarms for the metrics so like if the memory usage is high it will trigger alerts or if there is too many messages stack up in RabbitM2 it will also trigger alarms so we can take a look of the reason. Yeah, so next is how we monitor from the also messaging site for RPC calls. So we have proposed a library called also metrics currently working on an upstream so it's basically a Prometheus metric server that will export all the RPC RPC metrics from also messaging. So currently we're monitoring these metrics so the metrics are identical from server and client side so basically how many RPC calls start and end and how much time it takes to process those RPC calls and how many exceptions it happens for all RPC calls. So some of these are useful when debugging or trying to find out where the bottleneck is and I'm currently working hard in my free time to try to make it upstream. So here's a kind of a quick look at off the dashboard we have so basically we can see which RPC call is being done most and we have label for all RPC calls so we can easily determine like as you can see in the graph like there's a lot of novel RPC calls about updating objects or such and for processing RPC time we can also get to look at how much time it takes to process certain RPC calls like in this case we can see that Nova takes a lot of time to schedule an instance to find a place to schedule an instance. Yeah so this is currently what we monitor for for RAPMQ and RPC calls under OpenStack and yeah so that's all I would like to share today and here are some references there are some talks we made before in OpenInfo summits that's I also use those data in this slides and they have some of the talks have more details on why we set up RAPMQ in this and what's issue with it before so if you have time you can take a look of the video recordings of these talks yeah so any questions? Thank you a lot Jin was really interesting to see how you run it and to see the Oslo metrics graphs at the end see the other side of the work you've been you've been pushing in Oslo recently so let's open up for questions and more general discussion on skating RAPMQ clusters we can start with a question from Ederberg on the chat do you want to directly ask the question or do you want me to just raise it so I'll do it hints regarding the threshold to start splitting RAPMQ clusters per service so is there like a how do you identify the moment when you need to split RAPMQ clusters per service versus the single RAPMQ cluster yeah so as far as I know we we scale we separate our RAPMQ clusters or scale it out when the memory usage is high or or just if you usage is high so that's kind of usually the memory is the one that we look at as anyone else has other hints that they would give to like the threshold to start splitting RAPMQ clusters this is meant to be an open discussion so I don't want it to be too too much Q&A don't hesitate to unmute yourself and and join on our side we mostly monitor the CPU usage and it's based on that usually that we take the decision to either scale or change the hardware so CPU usage on the RAPMQ host yeah yeah so I can give our example as well so in our case we we monitor all these metrics but we made the decision to split the RAPMQs per open stack service so meaning that for Neutron we have a RAPMQ for Cinder we have a completely independent RAPMQ and this is exactly the same for every service so this means that the failure domain in our case will be per service and no and not per RAPMQ service so Dmitry you you are also saying that it depends whether you want to have HAQs or not so the general rule of thumb would be that the HAQ is if you do HAQs you need to to split earlier I guess we can't hear you Dmitry it's not working for Dmitry yeah the mic seems having some malfunction it sounds like a cat now it's so cute oh okay I guess you hear me now right yes yeah yeah so I managed to raise HAQs you kind of duplicate all messages in three so you will hit the limits away earlier but without them it might be harder to to fail over because eventually some more cues might be missing known other ones and well for prior RAPMQ versions like three eight or three seven sounds like this it wasn't fail over in good and you might finish in stack RAPMQ overall okay that's a good segue to the next question from Nikita Karpin around which version of RAPMQ uh you're using at line and and if you have any any suggestions on best RAPMQ versions to to pick I know that there are like recent improvements in RAPMQ that you might want to upgrade for so is there any our currently running 3.6.5 I think it's fairly old and I think there is a lot of fixes after 3.8 I think it's better to upgrade it that's these two three point eight point something but I haven't dug into it yet and we have also have a plan to upgrade it anyone else on RAPMQ versions yeah at OVH we used 3.5 or 6 I don't remember which one in the past but we recently upgraded to 3.8.4 or 5 if I remember correctly and we noticed a really good improvement we took also this upgrade during this upgrade we also updated the rapid policies so it's not only because we upgraded that we have a better performance but anyway it confirms that the newest version are better than 3.6 and in a particular thing that we had in the past a bug was very annoying at least for orchestras it was something related to the fact that some cues were not binded to correct exchange the binding was deleted or still present in when you list the binding but not working anymore and the result was messages were not rooted correctly to the correct cues and we had no way to monitor that and I found out somewhere somewhere I don't remember exactly where that it may be rated to the fight to a bug in 3.6 version so that's the main issue that makers upgraded to 3.8 and I must say that it's far better now okay so 3.6 at line but 3.8 is solving a number of bugs that might justify the upgrade and I have question but maybe you want to finish before the question which are yeah I'll go through the chat questions first so have you tried scaling write them to data nodes like instead of three what if you increased to five I haven't tried it yet because the CPU and memory usage looks good right now so we haven't tried it yet and um question from Seung Soo Cho on the version of Oslo messaging needed to use Oslo metrics yeah I'm currently working on a patch of dream and uh kind of needs some works to get it done yeah yeah so currently Oslo metrics is is released but it requires a patch to Oslo messaging that has not uh been merged yet so if you want to maybe uh Jean can can post the link uh to the patch in progress in the chat so that uh people that want to to have a look at it or maybe try it out on their Oslo messaging setup can but it's not it's not in Oslo messaging yet next question would be from Benjamin Furman uh what hardware do you have for the RabbitMQ nodes are each cluster on dedicated nodes yeah so um all the clusters are on dedicated physical nodes and uh let me just check this back I think uh we have 128 gigs of RAM and 40 cores for each node so we're using a fairly large physical nodes for every data node is it the same for the management nodes as well or is it specific I think it's same okay so you have five big uh servers yes okay then maybe some overkill Benjamin while you're running those on VMs or physical machines so we run those on visual machines and each cluster for Neutron because we only have a cluster for Neutron has three nodes I'm checking the specs I think it's 64 gigs that we have what 32 for each node okay anyone else wants to chime in with their shiny server specs it's bigger than what I have at home just saying so yeah let me finish so we have um the nodes running on virtual machines and they have 64 gigs these virtual machines okay then next question um what about what do you use to deploy these multiple RabbitMQ clusters and do you co-locate them with other services so you don't co-locate them with other services you said it were dedicated nodes but what are you using to deploy them are you using some uh deployment framework that is also deploying OpenStack or is it is it like completely separate yeah we wrote our own Ansible scripts to deploy our OpenStack and RabbitMQ clusters yeah so it's basically in-house but for the RabbitMQ Ansible script it doesn't do much it's basically just install RabbitMQ servers and get the settings updated ensuring I'm not sure why ensuring it's not oh it's fine interesting yeah okay so that we can see your face um what else uh anyone else with with like uh advice on how to deploy those RabbitMQ clusters toolkits that they are using on our website in Ubisoft uh we're using a condensable to deploy a RabbitMQ with all the different stuff okay I can give our examples as well so at certain we are using Puppet basically for to configure all the OpenStack infrastructure and we use the upstream Puppet modules uh basically for for everything also for RabbitMQ yes at OVH it's pretty much the same we use Puppet as well okay next question around um from BrainZank on do you have some suggestions according to different nodes for example is there a need is there a need how many RabbitMQ nodes are needed maybe like Brain can chime in if if I misinterpreted the question so Jin you're saying you were running clusters with two management nodes and three data nodes right yeah um I think how many nodes uh to run basically uh yeah as we answered before uh we just did look at the CPU and memory usage for for the RabbitMQ nodes and for other parts that if you find out there is a lot of queued messages that says not not acknowledge in your RabbitMQ nodes you may have to take a look at the workers of or the consumers for those queues so if like there are a lot of queued message in your NovaQs you may have to take a look at your NovaConductor and maybe increase the worker often okay I got to the bottom of the questions in the chat so I don't know you had another question yes I was uh wondering how do you out open stack uh NovaCompute for example is connecting to the cluster is it by adding specific cluster or three nodes IP in Nova.conf or are you using some kind of HAProxy in the middle or stuff like that uh we have a low hardware low balancer in in front of uh three data nodes okay yep so Nova talks to this load balancer and then it's balancing across the three data data nodes right yep how are you doing at LBH we are putting um the three nodes IP in Nova.conf for now but it's not perfect and I'm wondering if an HAProxy or some hardware load balancer in the middle could maybe help for example when we need to do an action on on one of the node cluster because for now it's not very easy we have to patch the configuration of Nova each time we need to for example replace one of the node it's not very easy how do you do that on sound side uh so exactly the same way you are doing it okay so we um all the way yeah exactly the nodes in Nova.conf okay any other question on why the time is you're scanning yes um you said you have 3,000 computes in your region is it only on one region or uh two the largest reason is uh two more than 2,000 compute nodes okay so that's still quite big okay I also have a question yeah um this is related with rivetin queue but also with architecture slide that you show show up with the the different services that you are running so you said that you have rivetin queue cluster for services like Keystone um so I I guess that is more for notifications I'm sorry can you repeat it again so the the connection that you have between Keystone and rivetin queue or Gliance and rivetin queue is only for the notifications for what sorry for the notifications of the these services oh yeah yeah right yeah um so how are you then extracting these different notifications from rivetin queue because it looks like everyone is doing something different and different tools to do this um I'm not familiar with this part so okay yeah sorry the question is not directly for you uh I don't know I'm marine I mean general notifications so might be even limited I guess right so you need them for I have no telemetry or something like this so you can just kind of use noob driver for notifications themselves right I'm not sure if I understood what you said I think they are still some I mean there are notifications as there are rpc queues and kind of notifications are required mostly for telemetry right so you can just well all of them are interesting so what we are doing we are extracting all of them from um from rivetin queue and we are storing everything on elastic search uh because it's interesting to see what happened to to resource in open stack but everyone is doing a different thing with them and then what got my got me curious is that genie also showing that slide k native for functions as a service and 12 functions as a service working properly it's great if it's integrated uh the notification system is integrated with k native or whatever tool for functions as a service uh that you are using and do you know how does this setup uh genie uh I think it's those are running on top are of our kubernetes cluster so uh I think not uh kind of separate it from our infrastructure as a service there oh okay yeah and it's handle by another team okay so it's just a different component that you run on the kubernetes clusters for users but it's not integrated at all with this notification system yep thank you okay do we do we have another question on scaling hobbit mq clusters or all the questions have been answered no I still have one thanks um you you had a slide about your policy uh ha stuff but you are not setting anything about miss h ttl and q ttr uh you do you have something related to that or you you just keep the default and we have some changes for it uh but I haven't dig it out yet it's maybe a global question as well if if anyone has a a good recommendation about that because I'm pretty sure we are not doing it correctly in the meantime I can explain what we do on all side we have um we set uh q ttl to one day and message ttl to 12 hours for me it's it's a lot it's more much more than what open stack expects because if I'm correct most of the open stack timeouts are about minutes 10 minutes or five minutes or stuff like that well I was trying to check um what there are values but it's much less than that definitely right so it's about minutes okay yeah I think it's five minutes if I'm not wrong five minutes for both ttl of queues and messages yes okay it makes sense actually because uh if we take nova as an example on the turn the agent is always expecting to have an answer between this rabid messaging timeout configuration which is my default I think 300 seconds so yeah anyone else on the ttl's settings that would be a perfect thing to feel the wiki page about the rabid mq configuration that we recreated in the large-scale meeting yes it's definitely one one question we should add can you like write it down somewhere so that we don't forget to do that yeah I will do that um so we have other questions on the chat Pierre is asking um the fact that you use three data nodes and two management nodes for each rabid mq clusters that it mean you need 15 rabid mq hosts per region like five for nova five for neutron five for other services yeah I currently will run it at 15 per region uh at least for the largest region we have 15 nodes uh I haven't checked I think it's identical for all the regions in production and how many hypervisors is that managing on in the right largest regions the largest one is more than 2000 okay thank you hey duberg all had like a question on the tuning of rpc rpc calls time window but it feels like it's it's been answered I feel uh did you did you have any other question around that edever no thank you that's all thanks um so another question from bren zang um and how many nodes in a region well that's what we discussed 15 nodes and how much pressure on the oslo messaging um how much do you how much nodes in the region do you have tested so it's probably like a 2000 nodes the question you just answered yes and finally a question from mark heckman uh what about the tcp backlog size in a rabbit that would speak for us we did some change above that out of each I'm pretty sure but I don't remember the numbers um maybe I can extract that later and write that down on on the wiki as well but we did some tuning on the tcp side for sure I like how we ask a question and everyone is like typing furiously to see if they can access the configuration yeah see who wins between gene and bermiro they're all looking yeah we'll definitely work on on documenting that on the wiki and I guess it's a good um it's a good uh moment to explain what we do what the so I think I won go ahead um so I put in the chat um a talk that we gave I gave with Ricardo few years ago in one of the open sac summits and the talk is about our networking but most of the slides is describing our experience with the wrapped in q cluster for neutrin and the configuration that we have there uh so just looking to those slides the tcp backlog that we have configured uh it's um 4096 but this goes through this goes through how we when we were adding nodes and migrating nodes from nova network to neutrin and growing the the the number of hyperbias on neutrin shows all the issues that we had with the wrapped in q cluster today it's a little bit different because we split our infrastructure in different regions basically to to avoid most of these issues um but to the configuration uh behind is exactly the same just the size of the cluster changed changed a little bit okay while people think about other questions I'll just plug a bit of an advertisement for the large-scale SIG um so to I already explained at the start of the meeting but what the large-scale SIG does is working to facilitate running open stack at large scale so mostly around answering questions that operators have as they need to scale up and scale out to their open stack deployments but also help address limitations that operators encounter in large open stack clusters so to do that we we work on documenting this journey from the start of your open stack deployment to scaling out to multiple regions or multiple cells and we also work on like Jin said we also work on specific additions to open stack that help manage those that that scanning journey so our slow matrix is a good example it's slow moving because we're not a lot of us working on it but if you're interested it's definitely a good a good place for operators of open stack deployments to get involved with with the project and to give you a quick view of the product of the large-scale SIG we have this wiki page here that describes the scaling journey and basically it goes from configuring the the deployment for preparing for scale monitoring the addition of a load on your on your cluster to detect the strain the limits then scaling up to a certain point at which point you have to scale out because you can't really you hit limitations in terms of scaling up a single cluster so for example if you look at scaling out we answer questions like when and what are different options what are the advantages of region versus cells which was the topic of our last video meeting and obviously all the questions all the questions didn't have answers yet but the goal is to try to answer them when we can so that people that go through that journey of scaling up their open stack deployments are not alone asking questions that everyone else has already been been through so the goal is to put that knowledge out because a lot of people have gone through that journey and have scaled their open stack deployments to massive scale but for a lot of people they are at the start of this journey and it's very intimidating and knowing that that path was traveled before and that we have some answers is really helping and do we have new questions before we close gene is saying they did not modify the TCP backlog option but it could be a good one to look into anyone else has questions or if not we'll close early thank you all for for joining us today and next in two weeks we'll have a more traditional IRC meeting for the large scale SIG to discuss like further improvements to those wiki pages but also topics we could discuss in future editions of this video meeting I feel like it was well attended again and very informative so we'll definitely continue to to run those in the future thank you all and have a great rest of Wednesday bye bye thank you bye thank you thank you very much thank you bye