 Hello, good afternoon. My name is Gosha also known as don't buzzer to read this long-glass name So and today I would like to present you our Work which we did at Mirages or last two years when we were trying to push the limits of OpenStack scale To some specific numbers. So usually it was like 400 Then like 1,000 nodes 500 nodes nodes now So and at the beginning it was at the time when we had OpenStack killer release We thought it will be relatively easy to do just like take whatever we have in OpenStack Allocate like proper hardware Resources and it should just work But actually it looked easy, but it was very hard to do and When we started to do this we started to see a lot of issues which we never actually thought about and never heard about so We created scale up with 300 nodes and we were trying to like install OpenStack According our like Mirantis OpenStack reference architecture and It didn't work. It didn't work due to like various problems various reasons most of you probably already familiar with and the one of the biggest problem and one of the biggest challenge which we had is the fact that there were rumors or Stories of fair tales that somebody somewhere did this so it should work, but unfortunately there were no Documents or like documented Solution so how it was done when it was And when it happens so we decided to create a team who will focus on pushing those Scalability limits for our OpenStack So at the beginning it was just Mirantis lab with 300 physical nodes and and five engineers Then actually Intel came in and proposed Hardware lab actually two labs which we can use To run our scale tests and do our work there. So thank you, Intel and We decided to fix the initial problem which we had so we decided to document everything What we we are doing and what we did before and At the bottom there is a link to the official OpenStack documentation side so you can find a lot of different information about scale performance and Reliability so all the work which we did is available for everybody So let's quickly talk about our lab. So what do we have and how it works? Slightly moving but so we have two different kind of servers So it's all inter Intel based hardware. So you can read the specs Probably not for Lenovo. Sorry for that So it's just like standard commodity servers which you can buy on the market Nothing special about them except some specific hardware parameters which We were unable to secure here So at the lab we use standard like spine leave for networking topology So we use Arista switches at the lab. So They are Part numbers are here. It's not actually part numbers but model. So you can find them them On the internet and find the information about that. So it's pretty Good and reliable network which we have we which we is Pretty significant like bandwidth and Throughput available which we can use so that we can run not only Scale testing but we also were able to run data plane testing and find the like performance characteristics of Data plane signs of open stack. So what will happen if you use? Neutron V is obvious. What will happen when we use? Neutron with all these OBS with DP DK or how scalable is DVR and How well it works when you start actually sending traffic from the nodes So As I mentioned we had so-called Mirantis opens the career and architecture which was pretty Standard topology which we used for our deployments and it resembles actually three physical nodes which we use to host all Open-stack services and they are like control plane parts and We use additional components like base maker and crossing to make sure that it works in HAA fashion So on and so forth, but for for bigger scale when we need thousands of nodes We don't have hardware for that. So as you saw with our like maximum size of hardware lab was 500 physical nodes, so we decided to actually virtualize this lab and For that we use containers which actually run hypervisor inside and Inside this VM. We run actually a nova compute full-blown standard configuration of Nova compute with Neutron OBS agents, so it's fully working Open-stack cloud except the fact that Compute nodes are actually virtual machines with nested virtualization and All services all open-stack services We put into containers so that we can individually actually change number of services which we run to accommodate the specific Load on our cluster. So I will go into details later But that's just a high-level overview of what we have in our lab and How it works? So let me step back and go to the history. So this was our first attempt Actually, it's the second one, but it was first one successful To run open-stack at scale. So initially it was relatively small scale like 400 nodes Which you still can find In like some specific companies who host big private clouds So what we did we just used our most reference architecture and We're trying to run some specific rally tests which are again available In the upstream Community to simulate actually Open-stack user activities on this cloud and Again, we faced some specific problems. It was time of the killer release so in utron our like first problem and rabbit MQ usual suspect in all like bottlenecks problem and Fortunately for us we had several Community leaders who were working at Mirantis. So I Davanum deems you probably know him very well J pipes and Kevin Benton They actually took those results from our lab and started to figure out what was the problem So where is the bottleneck and started to push those activities? in the upstream community and trying to Find the solution and trying to fix them. So actually most of those issues Were fixed. So that is why I just like doing this like shut out from So once we so once those problems were fixed It was at the time of me talk on open stack We were able to run successfully an open stack cloud at 400 nodes and we started actually collecting first Numerical values which we can later compare between like different runs and We actually started to create like portfolio of those like performance scale and a reliability metrics for each open stack Service and You can actually rerun those tests because they were done in form of rally Scenarios and you can go to this like performance Document page and find all the links where to find them and you can actually Reproduce those tests and probably compare like your numbers with our numbers to figure out if there are like difference and Where might be the problem if you have it? So in utron even at mitaka time frame a Neutron was the biggest problem to run open stack at scale. So that was one of the reason Why a lot of Hardware vendors or SDN vendors were very successful on customer sites to to push their Product instead of using standard neutron OBS with DVR So there are like specific problems Which we had inside neutron architecture Which had to be fixed and changed and those fixes were done in the upstream and Right now they do not exist and you can find the history of what's well was the problem and how It was fixed in the open stack review queue. So you have a link here okay, so After we did this 400 nodes run we figured out that yes, we technically can't probably push boundaries of open stack and Start moving to the bigger scale So and At those time it was again like most reference architecture, which was good for small scale, but never worked for 1000 nodes scale and the biggest problem which we faced at the at this time was Monolithic architecture of our controller nodes So when you go to bigger scale, it's not possible to fit all those open stack Components in such a way that they they can fit this single hardware So you need to figure out how to like Properly place them in order to be able to handle all this payload which Generated which is generated by open stack components like novel compute nodes neutron agents So those guys are pretty chatty and they generate a lot of Traffic even to sustain the operational states of open-stack cloud So at this time actually you probably saw those Interesting projects which were built around Containers so color measures was our like first attempt to figure out okay So how we can reuse whatever containers are available or were available at this time In open stack so how we can effectively use them in in our lab to Quickly change the configuration and the layout of services So for that we created this color measures project which was like moderately successful, but well It was never intended to use it at the production So this was science project to make sure that we have some tools In our lab to manage open-stack services individually So later fuel CCP created like containerized control plane which Was trying to address problems which we faced and Address the limits which we faced By using measures frameworks. So we switched again to upstream and created this fuel CCP project to use Python and Colloc containers And properly place them right now at this time on top of Kubernetes and MCP which is our current Mirantis product also like explores and uses the same approach So how to like flexibly place open-stack components across the hardware nodes But it uses different approach so once we had this framework and tooling For services placement. So the the next Problem appeared so we didn't know how to properly place them. So we there were no any capacity rules and We had to do everything from scratch and frankly it was like try learn Try fail learn and then go to success so We specified like two specific approaches which we will use we would use there So first of all, we were trying to create those hardware resource profiles for for each open-stack service so that you don't need to run it in order to plan your Capacity or to like design your architecture for your cloud. So as soon as you know How much specific open-stack service will use hardware resources specific like CPU and memory you can actually do like simple math and figure out how you will like distribute them and For that thanks to Clint Byron who actually Came to our performance Working group IRC channel and suggested to use actually fake Nova drivers so that we can emulate most of the open-stack control plane activities without actually using a lot of hardware Resources for running VMs with workloads. So for control plane Plane measurements, you don't need to run actual VMs but you need to be able to simulate all those activities and traffic patterns which appears when you actually Schedule a VM to be created on the novel compute node So this is like brief highlights of what those profiles looks like So this was a Liberty time frame when we were running 1000 nodes cluster in form of like containers on top of Kubernetes and What we did it was very simple simple actually test. So this is by not any stretch Production Layout so it was a very simplistic way how to measure those hardware resource usage patterns by Open-stack component. So we took exactly one instance of each service we put them to dedicated like physical machine and We were running 1000 fake compute nodes and started our rally test and Started to load Open-stack clouds with workloads with We started to generate a lot of requests to create a VM. So by the end we simulated the situation when you have Open-stack cloud with 20,000 VMs running on top of it Obviously at this test those VMs were fmvro. So they didn't exist But that all the mechanics which happens when you create the in this VM these mechanics was very well Preserved and we were able to measure all those Usages So you might ask why we have no a scaler Why we have eight instances of it? Because it appears that Nova scheduler is a bottleneck and It is a such bottleneck that you cannot actually saturate the single Nova API Service with having only one Nova scheduler So Nova scheduler will work so slow that Nova API will just sit and wait Scheduler will figure out where to place this VM So we had to like increase the number of schedulers just to make sure that we can actually push more loads to the other Open-stack services Okay, so and now we We go to the most like interesting part. So is it possible to push? limits to 5000 nodes so Yes, we did this like exercise on what sounds 1000 nodes and we knew that We can actually do this and we knew the way how we will do so But at this time we decided to move from the from those like fake environments where we didn't have actual VMs So we decided to move to the more realistic scenario where you have actual Nova compute nodes which able to run some VMs. We used nested virtualization for that so it's not probably the Real production environment, but it was close enough so that we we can emulate all possible Behaviors we which you can actually find and we which you can meet in your production environment So and again like Kubernetes environment was used to control all those containers with Nova computes and with Open-stack services so As soon as we have like flexibility in our lab because as all those Nova computes right now Containers, which are a part of resource I'm sorry a part of replication controller in Kubernetes we can actually easily change the number of computes without redeploying the whole cloud So we like specifically we use like several number of Compute nodes to figure out what's the what is the behavior of the services and resource usage when you use different When you have different number of compute nodes So as you see memory consumption is one of the biggest problem for our well-known Rebit MQ service So Rebit MQ is really greedy for hardware resources and actually for both memory and CPU and At some specific number of compute nodes you will see that Your Rebit MQ instance will not fit to your to the like boundaries or to available capacity of your physical Server which is used to host this Rebit MQ so there are definitely like scale Boundaries we which right now we can Predict based on our Results and we can actually find so when we have specific hardware allocated for our Components we we can actually deduct and figure out what is the optimal or what is the maximum number of Compute nodes which might be supported by this hardware and bottlenecks yes, so Definitely the first suspects and the first problem which you will definitely face when you run OpenStack at scale is Rebit MQ so there are multiple problems with With the way how actually a Rebit MQ system is used by OpenStack so first of all all Nova computes They use message queue to communicate with Nova Conductor and other OpenStack components so at the results more compute nodes you have more connections you have to Rebit MQ and unfortunately when you we run 5000 nodes cluster we have about 40,000 connections to Rebit MQ and Rebit MQ Does not handle those connections very well because those Components they send a lot of Traffic so it's not only about number of connections, but it's also about the size of the message because Rebit MQ When you use a cheek use it has to store them on hard drive for the reliability Purpose and if you like sit up Time to life Very big for your queue you will end up with Server which just sits and manage files on your hard drives saving all those messages So there are few tricks which we had to do in order to have working OpenStack cloud over 5000 nodes. So yes, we were able to To run it successfully so that we can actually run rally tests which generate user which actually simulates user activities like Creating VMs creating cinder volumes uploading images So you can do this at 5000 nodes, but you have to like tweak several components. So here there are like Some details what we did so we used Rebit MQ cluster We is load balancer in front of this so that we can actually distribute all those connections across multiple physical nodes So we decreased like time to life to make sure that we do not choke Rebit MQ with large number of messages, which it will save and keep on hard drives And at some period of time where we had to actually put two dedicated MQ cluster for neutrons specifically and Nova just to make sure that those two Services do not actually Affect each other so database Database was the second problem and a second bottleneck which we found at this scale So again If you run a lot of OpenStack components and specifically for Nova There is a notion of workers so so you you have a service instance, but this instance actually spawns specific number of workers and For each worker you have actually a dedicated database Connection pool so as a result if you are not very careful with those configuration parameters with number of workers and the size of database pools you will end up with see situation when when you have your database with literally like tens thousands of connections and My SQL specifically does not handle this situation very well So you will see a lot of like delays a lot of free connections and a Lot of actually lock situation So as a result the overall performance of OpenStack services will actually degrade because they will just see And wait for the database operations So please be careful with with with those numbers And the last but not the least Nova scheduler it still doesn't work well Unfortunately, so I know that there are several Initiatives right now running in OpenStack Nova to actually fix this specific problem with scheduling So if you run a single instance of scheduler, which is unfortunately a preferable way to run it You will not be able to Successfully spawn VMs on 5,000 nodes. It will be so slow that Nova API will start Retrying those operations because the previous one will will not finish within 3 minutes, which is a standard delay in Nova API so The workaround which you can use is to run multiple schedulers but because of the Nova scheduler architecture and approach There might be issues with specific placement rules and specific filters which Rely on the fact that they know exact situation and Exact resource allocation on your cluster And this will not happen when you have multiple schedulers because they see and they they look onto the OpenStack Resource pools Individually not sexually communicating with each other So we had a very nasty problem in our in one of our Customer accounts when we had a specific use case when we had to place huge VMs on Specific like and dedicated hardware. So we end up with situation when we had multiple schedulers, which we are trying to put a VM onto the same host and Those attempts to place they were failing So it was like heat template which was trying to place 300 VMs on 300 physical nodes first one VM was created all other VMs were not created because scheduler which like didn't know that This specific hardware node already allocated by another scheduler was like sitting and trying to Reschedule this VM to already allocates with physical host So that might be definitely a problem when you run Multiple schedulers and so it's in specific configuration and in specific use cases it will not work So as a summary and we have ten minutes being for the end that's good So as a summary I can say is that yes, you can technically run OpenStack cluster with 5000 nodes it will work with Standard vanilla OpenStack components in including Neutron DVR So that's I would say the only option if you run your Networking and at this scale Nova will work, but you need to figure out how to work around the problem with scheduler shared Services again, you can work around and scale them individually or like split them between multiple OpenStack services and run multiple Rebit the MQ cluster or databases, so it's all possible to do so you can successfully run OpenStack at 5000 nodes and we did this so now we are looking to the next step and we are looking to push those boundaries to ten thousands and figure out We'll actually OpenStack Architecture support such scale or not, so probably see you next year with new data and new numbers Thank you Questions, please yes, if you are talking about sessions this morning, I don't know Yes Yeah, yes, so I actually published a link on those like docs OpenStack.org performance team So we have specific numbers for different sizes of clouds. So yes, there are numbers for 150 for 200 350 so there are like different numbers which you can use and there is much more detailed information about like tests and results. So there are some like different usage patterns and Data for MySQL, Rebit MQ, how they behave. So you yes, you can find them online I don't think we had a session on the previous year about this Yes, it's all there. So if you go there, you will find like very detailed report and you can figure out because Generic guidelines. Yes, you need to figure out What's your usage pattern, how much like VMs you create, how frequently you create and Based on that you can figure out what's the rate of request to your API and based on our Resource usage profile you can figure out. Okay, if I have NOVA API which is under the pressure of user requests with the rate of two Request per seconds. So here is a response time from this API and here is the resource usage and based on that you will figure out so how to which hardware allocate for your OpenStack service and how to actually place them across different hardware. So the biggest advice is to get rid of unethic architecture where you just like dump everything to one host and then try to figure out why it doesn't work Yes Yes We tried that So yes, you can run Rebitmq as a cluster and especially if you use Active-active configuration when you actually have a load balancer in front of it and Put different connections to different physical service. Yes, it will work, but there are Problems with Rebitmq itself, especially with clustering. So what we saw on 5000 nodes, so we have a cluster which is under high pressure With like incoming messages it has it should handle a lot of messages So first of all if you use mirrored queue, you have to replicate this traffic across multiple Rebitmq services So you will have networking problems. So you need to like carefully design this Another problem if a single node fails Your cluster will fail and you will not be able to to restore it just because of amounts of payloads Rebitmq will probably be able to recover within like few hours, but your neutrons server Will die will lost all his agents you know will not work So you will have a like disaster Situations, so it's not even a cha problem right now. It's like a capacity problem. So that's your Components, we will not be able to handle situation with losing one node So there is an effort in open-stack community to replace actually Rebitmq with zero MQ But they have their own problem scale Sorry, were you using the filter scheduler or the caching scheduler? Say it again. We're using the filter scheduler, which is the yes. Yes We used filters. Did you try the caching scheduler? I don't think so. Okay, and What was the latest release that you had test results on? So five thousand nodes is actually a cata Okay, it was done like months ago. Okay. Have you started looking at any of cells V2 to? shard the MQ and database issues So could you please repeat? Well, I guess we don't have multiple cells yet, but yes cells V2 is this thing Yes, so you can shard the deployment. So I would say frankly If you're going to run five thousand node cluster might be Not in your mind, so I would go with actually cells approach and run cells Instead of having a single Control plane so we're actually looking for yes to do scale testing so for pike Yes, I would say that will be our next step. So take like ten thousand nodes run Cells V2 and figure out how it works. Sure. Tell Jay just Yes, yes working with Jay every day I wonder first I love me around this distribution and I've seen that when you guys moved from 8.0 to 9 rabbit MQ messages became Much less detailed I mean notification. Is that the reason What you just showed and and and then if if it is my question is at which point It I mean the notification are important because the environment is dynamic and you need to know that someone Just change the name of the instance or yes, right at which point You recommend The details of the notifications to to be Yes, so Right now we are talking about very specific component, which is called see lometer So the reason why those Notification were changed was the fact that say lometer is not able to handle them So we had so we did we run our tests in in our lab and we figured out What is the optimal parameters and what are the optimal size for very specific use case when you run see lometer in your environment and So with see lometer at least in the in that configuration which we used We were unable to use it effectively even on 200 nodes So you will see those like different options and configurations, which you can tweak And it will depend on your nodes Counts so if you rely on see lometer specifically 50 nodes is your most actually Efficient environment when you just said different configuration you mean in fuel when you choose Cylometer you will have different configuration options or So there are fuel Plugins, so we replaced some of components of see lometer So there is so-called telemetric plug-in Which is a replacement of see lometer and we have two Two versions of those plug-ins. So one plug-in uses MongoDB as a back-end storage and another one which is new one uses Kafka Plus something I don't remember is it This is a new storage back-end in see lometer Panko something like that Panko. Yes. So so based on whatever you you have you Will be able to tweak them I would say like Carefully so don't take right now my words as a given So you should be able to run telemetry plug-in with Kafka and Panko on 1000 nodes scale so based on our estimations and simulations These architecture should be able to sustain this We never Test this as a real like production like environment. So this is probably the next Stepping in our work So do we have a next speaker in this room another question? So thanks for the presentation I see this is a very academic exercise, but do you realistically think that people should run 5000 nodes with this architecture or it just to find out results right now 5000 nodes only behind one control plane looks to me very academic It is At some way. Yes, it is very academic because this is like test lab and simulation. So this is like science project but mcp, which is our product It can actually similar it can deploy the same Architecture so you can use kubernetes and place open-stack controllers as your Pods in kubernetes and individually scale them how practical if that's I Don't know. So it's actually it will depend on use case So how you plan to use your? production environment so if you don't need to run specific amounts of Computes like thousand nodes in one data center. So probably you don't need this architecture You can run multiple Small data centers and use like federation There are some specific use cases which our customer asked to Run 1000 node cluster with a single control plane great presentations and I have a question about the parameters and Since the parameters got the capability to like disable some of the metering elements of it is there any consider to like Disable some of those Unused unused metering and and maybe the performance can be better It could be so frankly, I'm not an expert in stilometer. So yes, I heard about that but it's more like This is my deepest understanding that yes, it's possible to do something with that. So I Think this is the right actually direction to figure out how efficiently utilize whatever shared Services you use and I remember that in the previous Versions of open stack open stack services Very chatty. So they they send a lot of notifications even nobody actually read them So yes, if you carefully tune this part and reduce amounts of Messages which are generated. So definitely yes, you can actually improve performance and Solubility of the moment and and so extend on that. So is there might be a way that You guys trying would like to try to like this kind of direction. We actually did. So if you go through This opens like documentation side. So there are specific tests around Cilometer and message Q. So we use Oslo message simulator So to emulate those like workloads or on rabid MQ. So we actually did this like profiling of Rebit MQ Q so that we know when it starts failed and we also run telemetry and we did some like tests for them as well. Those results are Available there. So so and we definitely will continue to do so telemetry is one of the components which which is interesting for our Customers as they use it. So we definitely will push actually forward this activity as well. Thanks. I'll check it out. Thanks Thank you Okay, any other questions? Otherwise, thank you guys. Thank you