 Hello, everybody. Welcome to this session titled lessons learned from a large-scale OpenStack deployment with Triplo. Let's jump into the session. So a little bit about myself. I'm Saishandar Malini. I work as a team lead in the Red Hat Performance and Scale R&D division. I'm focused on private cloud technologies like OpenStack as well as hybrid cloud technologies like OpenShift on bare metal. I'm based out of the greater Boston area in the US and in my free time I enjoy traveling and hiking. Over to my friend and colleague, Pradipta. Hi, everyone. I am Pradipta Sao, working as a senior software engineer in performance scale team at Red Hat. I'm focusing on technologies like OpenStack as computing and NFP and I'm based out of Pune, India and I'm loving to travel food and watching movies. All right. So like we mentioned, we are part of the performance and scale team. So what do we do? We are really on a mission to make sure that Red Hat OpenStack platform is the most scalable and best performing OpenStack distribution out there and because Red Hat is an upstream-first company by doing so we are also ensuring that all of our code and all of our fixes go into upstream and the community benefits from it. So overall end users have a better experience with OpenStack at scale. Also over the past few years, repeatedly talking to customers, OpenStack has matured to a point where it has a lot of functionality and customers are no longer really worried about the functionality but they're starting to expand their footprints in terms of number of nodes per cluster and so on and so forth and they're really looking for something that scales and scalability and performance has repeatedly come up as a priority for our customers. And when we talk about the platform scalability itself which is OpenStack scalability, you know, I think the first step to getting there is having an installer that is scalable because let's all agree. I think we all agree that we can't have a platform that scales without having an installer that scales and that's able to deploy a scalable platform in the first place. So over the past few years and few releases, we've been laser focused on making sure that OpenStack scales to numbers that have not been tested before. So where do we do all of this testing? Initially, you know, we were using partner labs to get to these high node counts because, you know, it's not easy to have hundreds and hundreds of bare metal nodes to do this testing. But, you know, starting in 2016, you know, we've realized that to shorten the feedback loop between the feedback that we get from scale testing and, you know, the product development itself, we need to build these capabilities in-house. So we invested time, energy, and money into building what we call the Red Hat Scale Lab, which is a lab in excess of 900 bare metal servers, you know, a combination of Dell and Supermicro bare metal nodes. This lab supports scale testing of all of Red Hat's products, including but not limited to OpenStack and OpenShift, of course. And, you know, the lab has really helped us accelerate our testing and our supported node counts and, you know, had a positive impact on both upstream and downstream OpenStack scalability and performance because this is where we do all of our testing and this is basically our playground. So because this talk is around, you know, building scalable OpenStack clouds with Triplo, let's take a quick overview of what Triplo is. And also, while doing that, let's all acknowledge the fact that OpenStack deployments are not the most straightforward thing to do because when you deploy OpenStack, you're basically not just installing software. It's deploying and architecting your cloud platform. So Triplo is a tool that helps you deploy and update and manage the lifecycle of your production-grade OpenStack cloud. So two important concepts in Triplo. You have something called an undercloud. Think of it like a jump host, but, you know, it's basically an all-in-one OpenStack install that has services like Nova, Neutron, and Heat to deploy and manage something we call the overcloud, which is your actual production cloud where you run user workloads. So the undercloud and overcloud and, you know, architecting and deploying and, you know, making sure your undercloud code scales is an important part of being able to deploy a scalable overcloud. So this is a quick snapshot of Triplo scale testing that we've done over the past few releases and years. Up until the Pyke release, we've hovered pretty much around the 300 node count. And up until Newton, we were relying on external partner labs. But you can see that starting with Pyke, we've moved all of our testing to our internal scale lab. And, you know, with the past couple of releases, we've achieved an excess of 500 compute nodes per cluster. And specifically with Train, which is an OpenStack 16.1 release, long-term release, we've achieved an excess of 700 compute nodes, all part of the same cluster. So let's look at the scale testing that we've done for our most recent release, which is Train. So I'll give a quick overview of the deployment before passing it off to Pradipta to take us forward. So the deployment methodology, like we've been discussing is Triplo. We're using Triplo as the installer. We have one bare metal undercloud node. We have three monolithic controllers that run all of OpenStack's services, like API and, you know, Nova Scheduler, so on and so forth, in an HA fashion. We have three Ceph storage nodes with four disk each and 712 compute nodes overall. So obviously we didn't deploy all 712 compute nodes at once. We slowly scaled up to 712 compute nodes over a period of time, you know, a couple of weeks. And also the Ceph storage cluster itself is pretty small because we're not really trying to improve Ceph scale here. We are more focused on the Ceph client side, which is the compute node and, you know, how many Ceph clients can the small Ceph cluster handle. The deployment was through Ironic and Pixie. We had about 14 composable roles overall. So one for compute, one for controllers, one for Ceph storage nodes and 12 different composable roles for the compute nodes because all of these compute nodes were from different manufacturers with different network configurations. So we really needed composable roles. So we were actually even scale testing in terms of the number of composable roles used in the deployment. Like I said, at the end of this exercise, we had about 712 Ceph clients looking at the undercloud and controller specs, you know, Skylake machines with 32 cores with hyper threading, so 64 logical cores. The controllers had a little bit more memory than the undercloud, but, you know, all these were pretty beefy machines. And, you know, looking at the Ceph storage specs, we have same Skylake machines, but in this case we have four 3TB NVMEs for the OSDs. So I will hand it off to Pradeep to take us forward from here. Thank you, Sai. Yeah, I think Sai has covered all this deployment type. Coming to the software specification, open stack version we are using trend release and the deployment methodology of OpenStack we use TRIPLO, which is based on the config download, Ansible best config download method. And for OpenStack performance testing tool, we are having PROBIT and RallyTool, which having various scenarios to execute to measure the OpenStack tenant workload. For monitoring site from end to end resource for undercloud and overcloud, we are using Grafana and Coletti as in installing on undercloud and overcloud nodes. For installation of OpenStack trend release, we use the base operating system Red Heart Linux 8.2. For networking back in OpenStack trend, we are using OVN, which is the default networking back in. For OpenStack control plane, we are using monolithic control plane, which is three controller and having high availability for functionality. For undercloud and overcloud, all the OpenStack services are containerized in trend release and it's all the container service managed by Podman. Next slide, please. Yeah, coming to the deployment workflow, as Sai already highlighted, that triple is having the basic principle of OpenStack on OpenStack. The first OpenStack services is installed on the as an installer, which is named as undercloud. So in the scale environment, we use undercloud as a bare middle node with the sizing of control plane IP, which is required for our 712 compute node and including controller and safe node. So we did not integrate any high advanced feature like Minion and routing functionality here. All the provisioning request is a layer-to-scenario. And the second, we applied the OVN patches, we aware of the OVN patches from our OVN team, like which is potentially fix the scale issues and also give us the stability in the scale environment. And as a prerequisite from the triple side, we created a heat template for the composable role for our controller, safe and compute nodes from the various hardware type and also we created profiles and flavors for different hardware type for further deployment. And once we have done all this template and profile flavor, we are trying a basic overcloud deployment with minimal node count of three controller, three safe node and one compute node. And we have successfully deployed that and after the deployment of basic overcloud, we continue to register the remaining available node in the lab allocation and we will continue to do further scale out of compute, available compute node. And in this way we achieve 700 compute node using a 12 composable role. And after all the deploy, like after the scale deployment, we have done some exercise in the control plane scalability as well for the key services of NOVA, Neutron and Cinder. For all this exercise we have monitoring the environment through Grafana and we capture all these document steps which is required for the scale testing and we capture the debug methodology and also we file multiple bugs for further fixation in the future and for the enhancement and fixation in off-stream site as well. Yeah, coming to the highlights there is no regression in scalability in OpenStack Relays compared to OpenStack Queen because of this, TRIPLO has introduced with the Ansible Post-Based Model config download which is significantly reduced the undercloud resource utilization and due to that we don't need to bump off any worker thread and default timeout value for Keystone like the key resources of overcloud for Keystone, Mistral and Jacker and post introduction of Ansible Limit feature also help us to minimize the deployment time because we can selectively choose which node we need to be configured overcloud services and the next highlight is Heat Engine is consumed the lower memory footprint compared to previous OpenStack Relays as we significantly optimize the Heat Engine from the Pike Relays and now we have it use the lower memory footprint in OpenStack Train and after the deployment as we did some exercise in control play like control plane scalability and performance for 2000 VM creation and volume attachment we did not also observe any regression and the next highlight the RabbitMQ is very stable in OpenStack Train Relays we did not observe any regression and did not find any kind of exponential spying in performance spike in our monitoring tool and OVN also is getting matured to scale the ascent to 700 plus compute node as a default neutron backend and the measured thing is the Ansible config download which is more faster comparing to the previous OpenStack release like Queen for an example using the limit option we tried 100 node deployment and it took around less than 2 hours which is quite faster comparing to the previous OpenStack releases here is the one snapshot of Heat Engine resource consumption at 500 compute node the Y axis is considering as the memory consumption and the X axis is the time duration of the Heat Engine uses for the scale activity so if you see that a Heat Engine is gradually increase the memory consumption in when we are batch of compute node for scale out activity and in 500 compute node we observe it still having 8GB of memory utilization where the thread utilization is having 25 core and it's comparing to the previous OpenStack release it's less and which is the good positive for the OpenStack trend release and it's a yeah coming to the 700 node like after 700 node deployment in we observe like Heat Engine which has configured with the 24 worker thread it consume near to 50GB of RSS memory and for further validation we tried Heat Engine restart the Heat Engine which bring down the memory to 3GB and which it looks like as a linear way and it slowly increase like the memory when we do the scale out operation for the remaining node and in the entire validation we did not observe any memory in Heat side. As we did some extensive control plane performance scale after the scale deployment in 700 node we use the rally scenario to execute 200VN request with across the compute node and with the various scenario like the nova boot server volume creation and attaching volume meet and list out the server and volume and attaching a network port so in the entire operation like if you see the graph there is no exponential spike over here and all this resource utilization is in a linear way and the best part is the nova boot server it looks like very less time like 13 seconds to boot a instance so which is like better comparing to the previous open stack releases next slide please yeah of course there are some challenges I wanted to highlight all this issues which we have updated it has already been fixed in upstream side and also be documented in documented so when we did the test the end user experience is little rough like from the deployment perspective the first challenges was we faced like over cloud deployment command with limit did not reduce the deployment time for safe Ansible as the safe Ansible script as kickoff by the config download and due to that we came up with some manual intervention which is support for the safe Ansible script with the limit parameter for specific nodes so that we can continue get faster deployment and we can reduce the deployment time and the second challenge is the vical expression consumed too much memory when we increase the 200 node count 200 node count and hit stack got failed with this error and it because of the new vical expression has integrated and due to and it's a so for that we have to increase the memory count but yeah we raise a upstream bug to optimize the vical expression that we can continue with the default memory node count in when we scale 200 plus node and the next challenges deployment server blacklist which is usually use this triple parameter to blacklist the specific or faulty node in the over cloud environment so we tried the scenario in using the limit parameter Ansible limit feature but there is an impact we observe like all this Nova compute services goes down and due to some additional Ansible playbook so our deployment framework team has addressed this issue and it has already been fixed in upstream side we tried to document the steps which is applicable for operators so that they can use this parameter in a proper way in last scale environment so continuing the challenges I will hand over to Sai and yeah thank you yeah so I'll continue with challenges here so one of the things that we observed was the default Ansible fork count that is set by the config download playbooks was 10 times multiplied by the number of processes on the machine that is the under cloud and in our case it was 64 core machine so what happened was the fork count was being set to 640 which is not a great idea because you are obviously going to be bound by the memory here if you have so many forks so we raised an issue and we worked with upstream to lower this fork count we also did extensive profiling of Ansible tasks while the installer was running so we were able to identify tasks that were specifically taking more CPU or memory and based on that some refactoring of the task was done also when it comes to OVN and scalability we identified that the OVSDB bundle was not replying to the liveness probe from pacemaker when it was really busy and when it was taking a long time to process its loop so what we did was we bumped up the default probe interval for pacemaker so that it does not do a failover when not needed also we had to bump up a couple of timeouts here like OVN remote probe interval and OVN open flow probe interval on the compute nodes to give OVN controller on the compute nodes enough breathing space when there is a lot of churn happening in the environment like for example when you boot 2000 VMs at once there is a lot of churn that happens in the environment and OVN controller gets really busy so we identified these tunings and we documented these so coming to the outcomes of this scale testing like any scale testing activity we want to identify issues before customers run into them and fix these issues so we came up with bugs that identified issues and we also came up with tunings that work for scale and in cases where it made sense we actually pushed these tunings as defaults upstream so you get this out of the box for free without any performance engineering would do we also feel confident that we've been able to provide a better end user experience at scale as a result of this testing and as I mentioned previously we did extensive profiling of Ansible tasks during installation that really resulted in some changes architecturally as well as refactoring some of the existing Ansible tasks to make sure that Ansible consumes a lower footprint on the under cloud as well as Ansible runs faster and as Pradipta went over and I also did in my previous slide we filed close to 15 bugs and these are the high level issues I'm not going to go into detail with each of these but just wanted you to know that we filed bugs, worked with developers closely and got these fixed so that customers have better experience at the end of the day the last part of this presentation I want to walk you through some of the best practices and lessons learned doing this multiple times over several years over several releases what are some of the best practices and recommendations when you have to do a triple deployment at scale so one of the first things I want to highlight is always try to use a bare metal node for under cloud as tempting as it is to use a VM we have to keep in mind that the under cloud is an integral component of triple low you're not done with it as soon as you finish your deployment you have to use it for upgrades and updates so the under cloud is a really crucial component and it's best to use bare metal so that you don't run into problems later on even if your initial stack deploy is fine I would also recommend using a 10 gig provisioning network so that the provisioning is faster and you're not bottlenecked if you use a 1 gig card also you have to plan your under cloud for scale right at the moment when you deploy things like the control plane subnet range are hard to change once the initial deployment has been done so always deploy and configure for scale I think it's also recommended to use memcached for caching with keystone and heat on the under cloud for better performance and you should get this for as a default right now as a result of a testing we push these as defaults upstream if you're not using telemetry on the under cloud it's better to set the notification driver on the under cloud services to know up so that it eases some pressure on the under cloud rabbit mq and usually I think the first bottleneck people might hit is when you scale past 250 nodes per cluster you start seeing heat stack update failures although you don't see them as often now because a lot of improvements have been done it could be due to insufficient keystone heat engine workers so watch out for that but again as defaults we have better work accounts now out of the box then Ironic has a lot of bare metal nodes to manage in this case 700 plus compute nodes the conductor can get a little busy in terms of CPU usage just getting power state and polling that so we've seen that pumping out the sync power state interval to a higher number which is the number of seconds here helps to reduce the number of CPU spikes also when we add you know a few compute nodes to an already existing large over cloud cluster I think it's recommended it's highly recommended to either use the skip deploy identifier flag in the case of older releases which use heat and over skillet config for configuration or the dash dash limit flag in the case of newer releases which use ansible config download to cut down the deployment time if you're not doing any other changes you're basically just adding a few compute nodes this is the best way to cut down the deployment time and make sure that ansible or in a puppet doesn't run on all of the existing nodes and only runs on the new nodes to save time also I recommend scaling up compute nodes in batches if possible because that will help you identify and root cause issues and they're any much faster based on our experience we can also say that the monolithic control plane itself can handle both control plane load as well as you know the scale of the cluster if the controller nodes are sized properly so sizing becomes really crucial here and yeah we are almost at the end of this presentation and I just want to leave you with a couple of links to some of the blogs we've written about scale testing open stack 13 and open stack 16.1 which are queens and trained and respectively upstream and with that thank you for watching our presentation thank you