 Okay, so I'm Simon Lee. I work in the Red Hat Performance and Scale R&D team. And I work mostly on OpenStack and also some integrated products like OpenShift where we do OpenStack deployments to support OpenShift and ensure OpenShift scales in OpenStack. So if you have any OpenStack performances, scale questions, we can talk about them right now. I wanted this to be a very free form discussion. So I'll just go over a few points and if you have other questions I'll be happy to answer here or elsewhere. So we cover two aspects, right? Performance and scale. So performance is when you want your API times to be faster, when you want your FIO, your disk to be faster. And we also cover scale, scale in a sense, triplo scale. How many compute nodes can a particular region have or how many compute nodes can a cloud have? So right now with OpenStack at internal Red Hat testing we've got about 300 ish compute nodes with a monolithic controller. So how many of you have used triplo? Okay, so you guys have used triplo, right? So in triplo we can do composable roles in the sense that earlier we were doing you know put Nova, put Neutron, put Swift, everything on the same node. But now we can do composable roles where you know you split some services onto some bare metal nodes. So with a monolithic based architecture we've got about 300 compute nodes but there are customers and partners outside like Fujitsu who are doing 2,000 node compute node deployments in their regions. So just I'll cover a few things you might, just triplo best practices because we all know triplo can be a beast, right? If you want to deploy and this is your first time deploying it's not so straightforward to deploy. So just going over some triplo best practices. One thing for sure, we've seen a lot of customers and partners who had very small road disks and you know spinning this as road disks. So definitely recommend having SSDs or MDMEs for your road disk because especially you know with services like Swift slamming the disks, Swift is co-located on the controllers the way it is right now. So you definitely want SSD with a larger space, you know? If you just have spinning this, you see database MySQL commits taking a long time. So if you enable slope-ready logs and as such you can see like just commits taking 25 seconds because Swift is already slamming the disks. And you can pass certain parameters to your overcome deployment command where you can say, you know, just put Swift on its own disk. If you have multiple disks on your controller node, we always recommend putting Swift on its own disk. Otherwise, because every other services using the road disk as well, you know, just get slammed with Swift. Just idle Swift, not doing anything, not doing any telemetry. Telemetry is the, you know, knocky and a silometer services in OpenStack where you're collecting metrics from your VMs, your networks, if you want to build a customer or somebody who's using your cloud, you want to know what their usage is, right? So telemetry collects a lot of metrics. It can use Swift as a backend. If you're using Swift as a backend, it's just going to take your cloud down. So pretty much always put Swift on its own disk. Even without telemetry, you can, you know, see it like 10% disk IO, just putting it on its own disk. So if you want to add this 10% disk IO on your root disk, it just makes things even worse with the cascading effect. And if you're not using telemetry, just advise customers or partners or yourself, just disable it. If you don't have a valid use case for it or if you're not using it, because whatever storage backend you're using, like Swift or, you know, Seth, it's, you will have to pay a certain penalty and you have to tune telemetry, right? The way it works out of the box, it does not scale a lot. And just thinking about other things, like the deployment best practices, we always, we always put a root password on our overcloud image whenever we're deploying so that even if network connectivity fails, like even if your deployment fails before step one, when a basic control plane networking is set up, you can go through IPMI and use your root disk and debug it. So the way I debug it is usually I look at OS-Collect config, I do a general CTL with OS-Collect config. So OS-Collect config is what is configuring, you're not to be whatever. Like if you wanted to, all the overcloud images are the same, right? They'll have the same packages, no matter if it's a compute node or controller. The magic happens when puppet configures it as a controller or a compute node. So I just use OS-Collect config to see what's happening. Sometimes your network environment files that are a pain to write, they also have a lot of, you know, sometimes you didn't provide a bridge name. You know, it won't tell you, the deployment command won't tell you that you forgot a bridge name. You can just cat, you know, ETC OS-Net config, config.json, you can see like the network environment file that you provided to the deployment command, how would it get translated and put on to that specific node. So you can see like if you see a bridge name as null, obviously you did something wrong and you did not provide the correct bridge name. And some of the other things, especially with OS-P13 since Queens, all the bare-metal scheduling is based on resource classes, right? Earlier, how no one in the undercloud used to pick the old cloud nodes is, okay, you'll put some flavor requirements, 4,000, 4096 and I am, you know, once vCPU, disk so and so, so pretty much any bare-metal node used to get scheduled, right now the way scheduling works is you have to have a resource class. So put everything to zero, RAM, memory, everything. You have to put a custom resource class which is like a custom underscore bare-metal. There is a naming convention and you have to put it. So we've seen earlier, we've seen a lot of customers complain that, you know, I'm seeing scheduling errors, it says, you know, that is the most famous error, right? No valid host found that everybody hates because there's no proper reason. There was a April Fool's Day commit that I remember from two years ago. Somebody put in a commit on April Fool's Day saying, you know, no valid host found and all that crap, so. So where was I? Forgot to write. Yeah, so always put a resource class and you know, I encourage you as fellow developers whenever you find a lot of bugs we see are configuration errors rather than actual bugs. So whenever you see that something that could have been caught with a validation, I encourage you guys to contribute upstream to triple validations. That's what I've been doing. Like whenever I find something, for example, you know, one of my colleagues used brint as a bridge name for triple O and we all know that brint is a compute node bridge, right? Like obvious is not too happy. You get all sorts of weird things when there are two bridges with the same name. So simple network validation where you check the network environment file to see if there's any bridge with the name brint otherwise fail the validation. You know, simple things like that can be done. The other issues I've seen with Ironic especially is when you put a, so how many of you get overclocked deployers on your first attempt? Like straight away bang. Very few, right? It takes a little while. So once your overclocked deployment fails and there is some node that you suspect to have a problem and you put it in maintenance. Earlier when you put it in maintenance, the power state, whatever was the previously known power state to Ironic, that was the one that used to remain there. So if it was power on, it would be power on. If it was power off, it would be power on. So now it changed to none, which is more of a representative thing. So when you put a node in maintenance and say power state is none, but please actually go and verify that that node is actually powered down because otherwise you'll have all sorts of weird IP conflicts where this node is already using an IP from a previous deployment. And you know instead of Oskillet considering one node it goes and configures the other node. So things like that can be kept in mind and I'm just trying to think of what are your scale issues or performance issues, anything you wanna talk about in specific, anything you want me to talk about in specific, your scale issues. During interspection. Yeah. What we want the overall. Yeah. Yeah, that is the right value. Yeah, so Instacn.json, everybody has a misconception that these are the things I have to provide. I have to provide the interspection MAC address. I don't even care about your MAC address. As long as they're on the same ultimate where they'll try to get the data, your RAM, CPU, that is all just placeholder values. You don't even need to provide that in your Instacn. It'll work totally fine. Just put your IPMI name, password and IPMI type. I think those are the three main things that you need to have in your Instacn. Ironic will do the rest for you. Since you brought up the interspection I'll also cover one more best practice using Ceph. We all know that when we want to deploy with Ceph, you have to tell it what are your OSDs, where do you want your journal located for Ceph. So earlier what used to happen was people used to use dev SDA, dev SDB, dev SDC. But you know this thing is very funky, right? The order in which your kernel loads your devices, it depends on your board. Sometimes it just changes randomly for reasons I do not even understand. So then, you know, performance and scale team work with, you know, the Ceph developers to add the capabilities into Ironic to get disk by path, which will never change. And now Ceph Ansible can use disk by path to deploy. So always use disk by path. Just give the direct path to your disk. Never do dev SDA. Sometimes you'll say dev SDB, use it for Ceph. But what happens is dev SDB now becomes dev SDA and you'll put Ceph on your root disk and it'll just overwrite your root disk. So always use disk by path. And while we're at Ceph, there were some problems with Ansible memory usage. Now with Triplo, we use Ceph Ansible to deploy Ceph. So the way it works is it will use the lower of the total node count or 100 as the Ansible forks. So if there is 85 nodes, it will put 85 Ansible forks on your undercloud and try to configure stuff. And we all know that there are some memory usage problems with Ansible that are just getting better with each release, right? So we saw cases where memory usage on the undercloud with Ansible just ballooned up to like 50 gigs RSS. So over a period of time, we are fixing it in Ansible, but the immediate solution was to reduce the Ansible fork count. So by default, we've changed it upstream to 25. So 25 is the Ansible fork count on your on the cloud. Anybody using Open Daylight for their deployments as your SDN? Okay, because that introduces a whole new complexity, right? Right now we're using ML to OBS. If you put Open Daylight, there are some of the tweaks you have to do. So nobody's using it, that's not worry about it. Anything else you guys want to talk about, like Swift, no one, you're trying performance scale. Yeah, we have done hyper-converged. So the biggest hyper-converged in the Red Hat scale lab that we have done is around 35, I think. We did not really go to a big number. There are a lot of hyper-converged questions on Rostec, I think, I don't know if you asked it. But there are people on our team like Benninglin, who I can give you context to that do hyper-converged and all this stuff. But the usual recommendation I think is per OSD. If you are putting Ceph also on your computer node, you have the account for Ceph, right? So I think per OSD, it's three gigs of memory that you have to reserve for Ceph. I'm not entirely sure, I can look at the values, but there is a certain penalty of memory you pay, and done, how many minutes I have? So yeah, we've done 35 nodes, but we definitely have more nodes that we can try. Any other questions, like NFE or, so just another best practice we use is, I don't know how you guys deal with customers and all, because mostly we don't do any deployments, we just deploy my own lab, but traditionally we've kind of tried to deploy and scale up in batches of like 30, 40 nodes, because when you're deploying 100 nodes, your deploy fails after three and a half hours, then you have to redo the whole thing, right? So we just typically try to scale up in batches of compute nodes, but with OSP10, I think OSP10 was the first release where I tried to deploy 100 compute nodes at once and it worked and after that, triple is pretty decent. Since we have time, I'll talk about one more interesting bugger I went to in OSP13 before release when we were testing OSP13 scale. Earlier we had deployments of 300 without problems, but OSP13, how much ever I was trying, I was never able to get past like 125 nodes and I was seeing weird errors like docker errors cannot mount volume at CSH non-host because it's not a file and it's a directory. So on further debugging, what happened was after the Lidward CVE last year, we are supposed to have an HCC SSH non-host file on every compute node or any node that triple deploys and this is being done through a batch script, right? The generation. So all the IPs or host names of all these nodes were being passed as arguments to a batch script. So we were pretty much hitting the OS limits of the number of arguments you can pass to a batch script. So there are all these funny problems you run into when you try to deploy triplet scale which are not even performance issues, but you're actually hitting OS level limits of how many arguments you can pass to a command which was actually worked around. But yeah, so we have these kinds of issues and I'm an expert but I'll try to help. So I have a, yeah, you are the right person. I was the one who did the performance testing for that. So I have a talk at 145 where I'll go through all the configuration files and how you can change simple values in configuration files to get a good performance. I'm not trying to advertise it but I have a slide on this exactly. So since you anyway brought it up, since OSP 10 openly switch based, flow based, contract based, firewall driver has been there but OSP 12 is when it went into full support and I actually have numbers to prove that a firewall driver is much, much better than IP tables because you're pretty much removing the Linux bridge from your data path, right? So I propose a patch upstream to triple in queen cycle that changes the default from Linux bridges to OVS but it was too late in the cycle. So they said stain cycle or stain cycle, whatever they call it, they'll merge it but I had no issues. You can change a triple parameter OVS, firewall driver or something, hit template parameter, pass it to the deploy command and you only need to do it on the compute nodes. So I've seen much, much, much better performance. So if you are dealing with customers, I encourage you to ask them to use this and the migration path also luckily has been built into Neutron itself. Neutron will realize, okay, now I need to use openly switch based firewall rule. So it's not much of a problem even migration wise. You don't need to build in any migration path into triple O. So I definitely encourage you guys to, because I'll give you an example. L2, just doing L2 VM to VM, between compute nodes on a 10 gig link. Your throughput goes from like six gigabytes per second to nine gigabytes per second. That's like a huge boost, right? That's like three gigabytes per second more. Anything else? Actually not, but unfortunately performance and scale is a very subjective thing compared to functionality where you can put a green or red bar and say works does not work. But what is good performance to you might not be good performance to me. What is good scale to me might not be good scale to you. So there are a lot of small, small values that can be tweaked to get out of the box in our good performance. That's what we try to do as performance and scale engineers, right? But sometimes with no offense to developers, developers do not understand the entire backing behind, us wanting better performance there. Like, no, this will do that, this will do that. But so what we've been trying to do is like, we are coming out with official docs or blogs and triple-opaches that are trying to change some things, but everything that's not changed, we are pushing docs upstream or within red hat to do the same. Yeah, so there are a lot of config values. In my talk, I'll just go over simple values. One line you change, you'll just get big boost in performance. But you know there are penalties. For example, there are some parameters you can change in Nova that will say, don't do thin provisioning, do thick provisioning. Like, whenever you boot the VM, the QCog image, just give it all the file system blocks. Don't wait for a user to request. If you do that, you'll see 1500 IOPS normally versus if you pre-allocate file system blocks, you'll see like 900 IOPS versus 1500 IOPS. But people don't want to do it, right? So if you do it, what will happen is every time you boot a 20-gig disk VM, Nova will try to schedule like, it will make sure that the 20 gigs is available. Otherwise it'll do like thin provisioning. So your boots will fail. If your hypervisor has a limited capacity, everybody is booting 20-gig disk VMs, but they're using only one or two gigs. You know, essentially it will show that it took 20 gigs, but it's not really like 20 gigs of file system blocks. So this is one thing. So developers will say, why do you want to do that and Nova will puke, right? If you do this. But you might want better IOPS. Your workload might want better IOPS or your workload might want better network. It's very subjective. That is the problem with performance scale. So we try to work with customers or partners, just try to analyze what their requirements are and we make recommendations on that. That's why we do not shift the flow with what defaults I think are good because what defaults I think are good. And also most of the times config files are like developers can't make a decision. So they just pass that down. You know, like timeout values and all, they don't want to put. So they'll put it as a config parameter, which is more burden on the user. So yeah, I'll hang out. If you guys have any questions, feel free to ask me. Thanks guys.