 And we also have, so the large scale SIG is described at this URL on the wiki. It's basically a group of operators of medium to large scale or to very, very large scale OpenStack deployments. We are trying to facilitate moving from the initial like scaling that you do like from tens of nodes to hundreds of nodes to the next level. How do you get past that hundreds of nodes to thousands of nodes and how do we document that it's easier for less scary for operators to go from those initial tens of nodes to hundreds and thousands of nodes. And especially we have three work streams. One is around moving the limits for scaling within one cluster. So as you add nodes to a single cluster, something will start failing over at some point. And the question is when and what will fail first and what can we do to push back those limits, document them and push them back. The second work stream is around documenting large scale operations. Sometimes our documentation is around like default values that make sense for all-in-ones or small deployments, but that are not suited to deployment at a larger scale. So identifying which configuration options you should tweak and what type of values you need to work on is the goal of this second work stream. And here then we also need a lot of input from the experience of other operators because it's difficult to come up with good values at scale if you're just talking from one standpoint. And finally the last and more recent work stream for the SIG is around meaningful monitoring. How do we get to the point where we have, sorry, where we have good information, actionable information that is being brought back from the system to the operator so that we identify those scaling issues really early on. So this session is around collecting stories that your experience as you added more nodes, what happened, what failed first, what did you have to tweak to get to the next level and at which point you had to basically scale out to multiple clusters because that was the only way to scale at a larger scale. And as we collect those stories, we hopefully will see some common patterns, some common things that are likely to fail first. Is it like RabbitMQ? Is it Neutron? And what did you do to push that back? And we can document that if we have enough stories. So to kick things, and this is an open session, like if you wanna join in and participate, please do because the goal is to really collect the input from new people that are not part of the SIG yet, whether than just a tag on the existing SIG members. But to kick off the discussion and try to set the tone, I'll turn it to Arno first at OVH, can give us some insight on their experience as they in the very early stages of scaling a single cluster, what ended up failing first and what did they do about it? Yeah, thank you, Thierry. Yeah, at OVH, we deployed OpenStack since years now and what first failed is Neutron. On our side, it's always Neutron. Why Neutron? Because perhaps because we are using a custom driver, we are not using a pure OpenV switch upstream driver, we are using our custom driver, but I'm pretty sure that it's also affecting OpenV switch driver. What is the effect is, for example, when we decide to restart every agent, every Neutron agent on the infra, it will overload a lot Neutron server and if we scale enough Neutron server, it will be correct, but then it will overload database and if it's not database, it will overload Nova, et cetera. So it's always because of Neutron agents at the end. So what we do to avoid that, first we monitor very carefully number of ports in build because we saw a relation between number of ports in build in the database and load on Neutron side. So it means if we have a lot of ports in build status, then we will for sure have a very high load on Neutron server. We also monitor load on Neutron server itself and databases, not that at OVH we used to deploy only one MySQL Galera cluster for all OpenStack services for Nova and Neutron. So that's maybe one point which could be improved on our side in order to reduce the load by splitting Galera clusters simply. But it's not very easy to do when you are running live. I mean, in production, you have to take care of how to move from one big cluster to separated clusters. It's not easy. So if you have a choice, it may be a good idea to separate databases at the beginning. We usually don't go over 1,000 nodes per region because of this, because of Neutron scaling issues. And what can I say more? Yeah, usually when we have, for the biggest region we have, we usually try to have bigger computes for databases and also have more lot of RPC workers for Neutron. In order to be able to avoid this overloading. There is one thing that I can say also is when it's going bad, I mean, when we restart every Agents and if it's overloading Neutron, it will never be able to be up again correctly because most of the time Neutron Agents is asking Neutron server informations. It's thinking basically, but it's waiting for RPC timeout which is by default, I think 60 seconds. So we increase that to five minutes or something like that. But even after five minutes, sometimes Neutron server is not able to answer correctly to the Agents. And when Agents did not, if an agent do not receive correct answers, it will start over. So it means every time it's starting over and over and it's keeping a Neutron server overloaded and we never end up being in this situation is kind of a nightmare. So what we do usually is we stop most of the Neutron Agents and we start them very slowly in order to make sure that Neutron load will not go above one threshold that we decided and being able to answer in the correct time slot to Neutron Agents. And what we, so based on the fact that we decided to not go over one K nodes, we scale out by multiplying number of regions which is good because we split control planes for every region. So if one region is down, it's not affecting others but it's bad because it imply a lot of regions to manage and it's also complicated for customers to understand that they are on the first region and not on the second one but these regions are the same. So it's quite difficult for them to manage all of this. So yeah, I think I said everything. That's very interesting. Thank you and we'll see if that hits also a nerve at other operators if they have, if you had the same type of issues as our node OVH, please mention it when it's your turn. Maybe we can switch to Bill Miro at CERN. Hello everyone, can you hear me? Yeah. Yes. Okay, I'm in the office, I need to wear a mask. Sorry about that. So our deployment, can you hear me, Bill Miro? Yes. Yeah, we're good. Awesome. I had a question for Arden also. Should I wait for the end or can I ask it now? No, please ask it now. Okay, so we did run into exactly the same issue with RPC worker thread and we had to increase the count. But the question I had was when we encountered this problem, we saw that the RabbitMQ was getting really choked. It was continuously trying to close the connection. So my question is, did you have to tune RabbitMQ in any way or was there only change that you had to do was around RPC timeout and RPC worker thread count, et cetera, or was it anything around RabbitMQ? Yeah, we did also some very basic tuning on RabbitMQ side. We did that at the beginning when we deployed our first OpenStack region. But we also had issue on Rabbit, but not because of OpenStack itself, but I think because of Rabbit bugs. We had some Rabbit issues because of old version of Rabbit. By upgrading Rabbit to the latest version, we managed to have something which is correctly working. The thing about Rabbit is that usually in the team, nobody is an expert on Rabbit side. So it's usually hard to debug because we are not used to manage RabbitMQs. And when it works, it's amazing. But when something is broken in Rabbit, you are kind of lost in what should I do? Oh, what is the starting point of debugging a Rabbit cluster? So years after years, we built some documentation and we are now able to completely manage RabbitMQ based on the documentation we have internally and based on the tuning we did, but nothing fancy, just kernel TCP tuning or very basic RabbitMQ settings. And yeah, we did also one thing. If you check, there is a recent thread on mailing list about RabbitMQ tuning, RabbitMQ policy, for example, and HA, how do you enable HF only for some of the queues and stuff like that? So we applied that on our side and for now it's working quite correctly. Okay, yeah, thanks. Yeah, sorry about that, Vermeer, can you start again? Yeah, it's fine. So, CERN, we started our deployment around eight years ago that we started playing with OpenStack. And from the beginning, that we believe that having everything in one cluster will not be enough for us because we wanted to move all our nodes into OpenStack. So we started from the beginning, exploring Nova cells, initially only two cells, very big cells, more than 1,000 nodes that we had per cell. But early on that we observed that this was not a very good idea to have so many nodes in these two different clusters, these two cells. We've been learning over time, initially also we had our control plane running on physical nodes, everything was clustered, everything was replicated. So what we've been doing and learning over time is that at least for our reuse case, we don't need all of this. We move the control plane to run on virtual machines. We don't have a central database for all the services, for each service, we have a different database server for that database. For example, we have a database server for Nova, for each cell, each cell has its own database server. Because this allow us to have the different databases in different places, would give us somewhere down the sea, at least per cell. And also this happens for all the other services like Aaronic, Cinder, Keystone. So clusters on the control plane, we don't use them. We only have one rabbit and queue per cell. The only thing that is clustered is the Nova top controller for cell zero. Because it's basically the bottleneck for all the messages from all the cells. Only that one is clustered in Nova. For the other services, for example, we have the rabbit cluster for Newton because the amount of messages, basically that is to spread the load between the different nodes in the rabbit and queue. Smaller services like Cinder, Aaronic, they don't need the rabbit and queue at all. Even if at the beginning and even recently, we've been running them. So what I mentioned here. So yeah, we have been, instead of trying to these big clusters to work, we are trying to have a lot of them, having a lot of cells, but with very small number of nodes. So currently per cell, we have around 200 compute nodes. And that's for connecting to database and rabbits. We don't observe any issue. The issue that we observed when we start introducing Newton into the infrastructure, because we started with Nova network a long time ago, was that we cannot add this logic of Nova cells into Neutron. So exactly the same issues that OVH sees, we start seeing them. Initially, we had only one region. So more than 8,000 compute nodes in one region. And since we are moving them into Neutron, reaching 3,000, 4,000 compute nodes, we're starting seeing problems with Neutron. So at that moment, what we decided to do was basically to split infrastructure into regions, basically what OVH has. In this case, to scale Neutron, because in that case, we have one Neutron instance per region. So now we have cells on top of cells and we have regions. Currently, we have three regions, and these three regions, the Neutron in those regions, manages around 3,000 compute nodes. Also, maybe our use case is a little bit different because we use a very simple Neutron driver that is the Linux bridge. Even though it's very chatty as all the others. And the main issue that we observe in Bravid is not really on the server side, on the Neutron, it's not really the server side, but the Bravid thank you. So when we have any issue with the Bravid cluster, it's an operations nightmare. Especially when we have some kind of power cut in one cell and all of them connected at the same time, we have message over flow on Bravid and that is very difficult to handle. So, sorry, Belmiro, so if I summarized correctly, you basically had the same type of issues with Neutron and Bravid MQ being the reason why you pushed to aggressively using regions as a way to scale further. Is it accurate? For Neutron, yes. So the only way to escape this is to have multiple Neutron installations. And for that, you need regions. For other services, like small services, like LAN, Cinder, we don't have, we deploy only, we only have one deployment that we use in all regions. We don't separate them. So I think we have a common pattern here, being Neutron, probably the most difficult project to scale currently. In terms of bare metal, we are trying to move all our nodes into ironic initially, because if you are using cells, there is no different way to do it. We needed to have all the nodes in one cell. So we are observing some very, very... It's extremely slow to communicate with this, with ironic and to actually do operations in ironic because of this. So we moved to conductor groups and we apply that into Nova and ironic. We will have a presentation today about this. Also, we also wrote a blog post where we have more information. I think about the elephant in the room, that is rabbits and Neutron. Initially, we are playing with the different configuration options that we can have in rabbit. We also gave a presentation two years ago in Vancouver about this. I put the video in the other part, if you're interested about all the configuration that we are currently using for rabbit and Q and Neutron. Thank you. Is there any quick question for Bill Miro before we move to the next one? I have a small question. Have you ever seen any issues with Cinder as well with like a high amount of volumes being created and quotas deadlocking when provisioning new volumes? I don't remember seeing any issue with Cinder. Arne is connected. I don't know if he wants to comment. But for Cinder, I don't remember this particular kind of issue. We had issues with deleting volumes, so this is why we introduced into Cinder the trash functionality where we basically do a quick deletion and only mark it for a deletion and we have something in the background that cleans it up. So deletion was more of an issue. Creation, I don't think we have ever seen this. And we use this trash functionality which is based on the SEF trash functionality in production since, I don't know, two or three years. So it's working really well. And this allows basically to delete a volume right away, also a big volume, basically within a second or so. The quota is giving back to the user and the user can create a new volume while the deletion is happening in the background. So the user doesn't have to wait for the deletion to get the quota back. Cool, sounds good, thank you. Okay, let's quickly move on to IBM's scaling story. I don't know who put that in. You Divya, anyone for the IBM scaling story? Hi, sorry, I did not hear that. I'm sorry about that. Yes, so, well, we've been hitting multiple issues. So for example, the first set of issues that we hit was the neutron, which is already covered. Along with the neutron, we hit the second set of issues where rabbit MQ related. So we did some tuning parameter changes specifically, we added the backlog, we increase the backlog and the timeouts in, so backlog in rabbit MQ and the timeout in the different OpenStack configuration file, along with RPC, worker thread and timeout, et cetera. So that kind of stabilized the neutron situation, but that was with OpenStack's time that we hit it first. So the thing is we've been using OpenStack for quite some time and it is with time that we first hit this issue, even though the number of nodes, so our customers, they keep using it and when they upgrade, we see that they don't have issues with, say, queen, but they hit the issue as soon as they upgrade to the strain. And then now that we're hitting, we are testing with Usuri. So we test with 100 concurrent tests and move ahead with that. So we see a lot of issues with Keystone. So apart from the neutron issue and the rabbit MQ, the next set before even we moved to Keystone is with MariaDB. So when we go ahead with the concurrent test, we often run into databases locked kind of issue. This happens typically with around 100 concurrent tests or so. So that's when we went back and started looking at MariaDB documentation and updating few values and there's a lot of documentation on tuning parameters. So we tried that and then it seems to be working. But with Usuri, there have been a lot of problems with Keystone. Every other time we run into Keystone is temporarily unavailable kind of error. And so Keystone internally is configured with Memcache. So we often run into Memcache socket timeout issue. So for that, what we tried was, I think in Memcache configuration file, there's a cache value which was increased after which we didn't see that issue. But then again, we run into Keystone issue. So we have Keystone running in HTTP web server with a default value, which is basically process with a single process. And with all these releases, we've not hit issue with that but then with Usuri, we tried increasing it to 25 and then now we are running with 50, with 400 concurrent VM deploys. And once we increased Keystone processes to 50, then we saw that we started hitting MariaDB, DB log issue again or then again, we had to go back and fix MariaDB. So it seems to be like a cycle that you change a value at some layer. It starts showing up at the next layer. So everything has to be increased. So similarly, we also had a lot of issues with the open file limit, not limited to a particular service. We see it right in RabbitMQ, NOAA, et cetera. And then we have to go and change the open file limit at different levels, right from system CTL to the actual service file. And so various levels so that it actually reflects. Yeah, so these are the high level problems. So primarily a lot of problems we hit with RabbitMQ and now with MariaDB and specifically in Usuri with Keystone. And I don't even want to go into things like NOAA, et cetera, which is, I believe, I'm not sure how many, how many deploys are currently using NOAA, but we see a lot of performance issue with that as well. Okay, any quick question for Divya? If not, we'll move to Line. Do we have anyone from Line to present their scanning story? Yeah, I will. Hey, Gene. Yeah, so basically what we found out is like, any this type API calls time will gradually become slow because kind of like when you scale out, you will have more users and more servers. So it's kind of a small issue for when you scale out. And the second thing we find out is that NOAA's RabbitMQ clusters on a packet becomes larger and larger when you scale out. If you didn't change the number of control plane nodes or workers for NOAA conductor, as you know what you will notice is because it didn't hit the performance in API's response in a very nice way. Yeah, and for most of the RabbitMQ clusters issue, we're currently using Oslo metrics to monitor RPC calls issues. And there will be a talk by my colleague, Redu Sun and Motum Sun today, later today. So if you are interested, please check it out. And I'm currently working on to upstream this part. And another issue that we found out happened very frequently is that scheduling batch instance creations becomes very slow and timeouts. So we just observed this issues when scheduling around 150 instances to around 1,300 hypervisors. So the API will timeouts before the instance is actually finished scheduling. So we make a small trigger inside our scheduler to kind of increase some of the performance to temporarily solve this issue. And the last part is kind of to be aware of database setting like max pool size or max overflow as we're getting kind of some complaints from our DVA teams about too many connections as we scale out. Yeah, Redu Sun, do you have anything to add on? Yeah, just like Devya Sun said regarding the IBM scaling story, we also had actually some issues regarding the open file connections with the neutron server. So due to multiple compute nodes connecting with limited neutron servers, we experienced that the number of connections which the sockets were making increased dramatically. And we did face some problems, number of open files. Okay, thank you all. Any questions for Redeep or Jean? If not, we'll move on to StackHPC story, I think it's John. Hey, can you hear me okay? Yes. Excellent. So this is maybe a slightly different story, it's quite a point scaling exercise. So there's a, I'm gonna present on this later in the week but the kind of crux of it was, there's a big slurm cluster, well, big ish cluster. It's about 1200 nodes. And the ask really was, how do we quickly reimagine all of these as quickly as possible? So rather than the number of nodes being active, it's really how quickly can we rebuild them all? So this was built using Color Ansible and KAB, just using all the defaults to start with and then seeing how it, see what broke. I suppose to start on one side, that's the networking. So with Ironic, we're using multi-tenant networking. So we wanted to reconfigure the switches. We started, they were cumulus switches, Melonics cumulus switches to start with reusing Ansible networking but it was just proving a bit slow. We moved to networking generic switch, which was better. So it moved from sort of several minutes to reconfigure, well, we got it down to about 300 seconds to reconfigure the whole switch, which is still pretty slow. And in the end, had a look at batching up the commands to the switch inside NGS, that touches up for review. Well, it's in draft up for review but that really sped things up. So then the networking wasn't so much of an issue. Eventually we were able to end up hitting limits in Ironic conductor. We were only running three Ironic conductors for this particular test. Really just wanted to see how far that the three could go. And there's single process, but with tuning, effectively I ended up finding out that the HA proxy logs were particularly interesting when you understood the crazy format. So if you go read the docs on the HA proxy logs, it turns out we're actually hitting connection timeout limits. So basically it's the amount of time it takes, as I understand it, from starting the connection, opening the socket to actually the client completing the connection. And Ironic conductor was struggling with this when it was under heavy load. I think simply because event-wise it was just busy doing other things. After the socket was opened and then you eventually got back to try and write the rest of the request at which point sometimes HA proxy or MariaDB had decided to kill the connection. So with finding out that seam of interesting timeouts I'd never discovered before, that got around that problem. I guess there's some interesting bits and pieces about moving to the direct deploy driver rather than the Iscutty driver, although the Ironic community removing the Iscutty driver is going to remove that problem for many people. So the decision's not there, which is all good. Historically, we were just using the Iscutty driver but direct with HTTP seem to work really well. Forcing raw images is on by default. By turning that off, it made quite a big difference for this particular use case because the QCOW image was much smaller than the raw image. So instead of transferring the raw image to all of the computes just transferring the QCOW image saved quite a lot of network bandwidth. So that certainly helped. There were some issues that Julia's commented on but that was the dominant reason for that change. I guess, yeah, and the Ironic cache once the image was in the cache that seemed to work really well with the direct deploy. So yeah, it wasn't too many tweaks although learning the HA proxy log format I'd recommend if you start seeing timeouts because that was really enlightening. We've got Prometheus logging on which told us like what was happening to a certain extent with NodeExporter and CAdvisor just all standard colonisable stuff but I think that was the breakthrough moment for me. Yeah, open to questions. Any quick question for John? Is there anything you, and it's like a general question, is there anything you wish you had known before? You know, kind of a revelation that you wish you had known before you started that scaling journey and that you now know and it's pretty obvious, but it's... I suppose the first thing I'd say that I was glad I knew is that people like CERN and others were running Ironic at a larger scale than I was trying to. So I figured that there was a success at the end of the journey. So that gave me confidence to keep plugging along and I'd say that was kind of used very, very useful. I suppose I should say I also made a bit of use of the guru mediation stuff, GMR. Ironic's got really nice docs on that for what it's worth. That tells you what all the event that threads are happening in the process which is sort of interesting. Although in reality once I read the logs we actually told me what was happening. But yeah, knowing the other people of Trottern before this, I did also have a chat with Arne about some of these problems as we hit them which is probably when I discovered we were doing more... I don't know, were we doing more image builds than I think you were probably trying to do at one time but even so it was good to have that. And last question to everyone that has talked. Is there anything you wish you had known before starting or is there like a pro tip that like you can communicate to the rest of the group? And otherwise is there anyone else that has a story to share? Like 53 people in the room. I bet there are other open stack operators that you don't have like to put anything in the etherpad, just pick up. I can just chime in on some neutron stuff. We occasionally, I don't know if a lot of people have this use case, but we have the use case of a lot of VMs going up in parallel together. And we see a lot of issues like for example with neutron being unable to allocate IPs because it ends up being somewhere deadlocked because of so many ports going up at once inside the same subnet. And we're talking about like potentially like 200 to 300 ports being created in parallel. It could be some like weird thing that we're running into. I tried to investigate a little bit and some of the code that handles retries and things like that, it seems like neutron has its own retries code for deadlocks not using the one provided by OsloDB. That seems to be a historical thing. And so I think part of why it's not handling it greatly is that I pushed up to like a POC patch but I never got around pushing that through. But I don't know if anybody else has any similar stories. I did see that as well actually when I was doing this rebuild work because obviously you create the provisioning ports at the same time. But that was only a hundred or so ports at a time because we were throttling it. So it's definitely, I'd be interested to see what happens. For what it's worth, the way I worked around it was horrendous. I increased the size of the subnet because it was a slash 16. I increased the allocation pool. Yeah, which is something I ran into. Okay, so we have five minutes left. And since the next session is starting just in five minutes we'll try to end early. I wanted to point to next steps. So we'll have a meeting during the PTG next week for the large scale SIG. So if you're interested in helping with the goals that I explained earlier, so for those who just joined scaling within one cluster, which is why we're collecting those scaling stories, documenting large-scale operations, basically having a book for configuration values that makes sense at larger scale. And also start to think about how we can improve meaningful monitoring because one of the constants in all those stories is that it's not always obvious to which parameter to watch for early failures. And so having kind of a actionable, simpler monitoring rather than the wealth of things to watch, trying to narrow it down to the golden signals that we should track and how we can improve on that. All that to facilitate getting to a larger scale for our users. So if you're interested in helping with those, I encourage you to come and join us at the PTG next week. We have two sessions on Wednesday, one at 7 UTC, essentially for Asia Pacific time zone and Europe time zone to discuss. And one at 16 UTC, so that one is more Europe and US friendly. So wherever you're located, you should find a time that actually works for you. And I encourage you to join us there and we'll continue the discussion that we started today there. So please come and we'll have regular sig meetings that we'll resume in November once the PTG and summit meetings are over. So well, that will be the next step. And I put a few links to the etherpad if you're interested in seeing where to go next. So please come next week at the PTG if you're interested in helping with the large scale sig or if you have a story to tell and we'll continue the discussion there. Any other last minute mention? Hey, Terry. Hello Rico. Hey, I just have one last minute questions. I was just wondering is anyone in large scale using the Keystone Federation in like in multi region area? And is it like good or is it cause any problems? Can anyone? Using what? Can you? Keystone Federation, federated Keystone. Yes. Not out of age. Okay. I think Seron is using it, right? Not for the regions. So we just have no region on Keystone. Okay. Okay, but so if anyone using it, I'm going to leave some messaging in the etherpad. So if anyone have anything to comment or feedback, please help us. We're just trying to figure out if there's good things. Yeah, I have been using it, but not a massive scale. The Federation, Keystone Federation that is. Yeah. I just got answers. Maybe there will be like what line is facing the large scale user performance issue. But I'm not sure if that's something. I don't have a large scale to actually testing out. So well, thanks. Okay. So that concludes our session. There are two forum sessions starting in one minute, one on the open stack client feature gap and the other on operational concerns on Manila. So please join there if you're interested in discussing those and I'll see you around because I'll join one of those. Thanks everyone for sharing. And I hope I see a lot of you at the PTG next week. Okay, see you. Bye-bye.