 Let's get into the big topic of open source, something that we actually say about it. This is so awesome. We are an open culture that is actually able to fix that process that a developer or, let's say, as the Kubernetes ecosystem really brings. And welcome to this week's Ask an OpenShift admin live stream. We are here, and you'll notice that it's, I don't have my usual co-host with me. Johnny is on some well-earned PTO this week. But I do have our guest, Mr. Frank Bodden joining us today. So if you happen to miss the title of this particular episode or the subject, we're going to be talking about the performance add-on operator. So we'll get into much more details there. This one pokes all my nerd buttons. It makes me happy to be able to mess around. And when Frank and I were talking yesterday, it was, you know, it's, this one's a little dangerous, right? You can break things, and when you break things, that's bad. But I'll quit talking about Frank. And I'll let Frank introduce himself if you don't mind. Hi, everyone. So I'm Frank Bodden, based out of France. I'm a technical product manager on OpenShift for telco and networking and focused this year on the 5G core rollout. And with a lot of performance add-on operator. Yeah. Yeah. And I know the technical product manager aspect there is really important because you are one of our more technical product managers for sure. And I think that's something that is really good and really exciting to our audience here. Let's see. You're very modest about all of this. You keep telling me, oh, it's probably going to be a short subject. I'm not so sure. Because like I said, this is one that's interesting to me. It's really, again, it pokes all of those buttons for me around all the stuff that I like to tune and adjust and make sure that everything is operating the best that it can. So we'll get to all of that in a few minutes. But first, let's do our normal top of mind topics. So first and foremost, this is one of our office hours series of live streams here with Red Hat live streaming, which means that if you've ever had a manager or a professor, anybody like that who had office hours, we're here to help answer your questions, whatever it is that is on your minds. If you have a question about OpenShift, about how it works, about something that's broken, maybe just what's the best way to do something, feel free to ask all those questions in chat. So it doesn't matter which one of the platforms you happen to be watching us on, whether it's one of the YouTube channels, whether it's Twitch, any of that stuff, all of that chat gets accumulated and it gets rebroadcast across all of the others. So by all means, go ahead and send us those messages and we'll address them as they come in. So with that being said, let's talk a little bit about the top of mind topics. And just as a reminder, these are things that Johnny and I have found or we think that might be interesting to you all that we discovered kind of since the last live stream that we did. So the first thing that I want to talk about here is going to be a blog post that was recently published. You can see it was published on Monday. And this is around Ansible and OpenShift integration. Let me go ahead and grab the link here and I'll put that onto Twitch as soon as I actually switch over to the live stream here in Twitch. There we go. So this blog post describes how to configure essentially OpenShift to act as an execution engine for, well, Ansible. So the really cool thing here is that, and if we scroll down, it goes through and talks about all the stuff, all the steps that you need to take. But the thing that I liked in here is we have this graphic where I can have an Ansible job inside of Tower and when it executes, when it launches, it's reaching out to OpenShift and it triggers a job inside of OpenShift to actually do that. So what this means is you're kind of unlocking, unblocking the scale of Ansible and Ansible Tower to unlimited. As many jobs as you need to run, as many concurrent jobs as you need to run, they can all be scheduled against the Kubernetes cluster or an OpenShift cluster so that you can take care of those. So I thought this was really cool. It's something that's interesting, particularly as we start to see more and more cross infrastructure, cross application integration happening within our customers and the applications that they're doing. I've seen and I've worked with a number of customers who, they'll, hey, my application needs to deploy something or my application team needs to deploy something and some of it is an OpenShift and some of it is in virtual machines and some of it is physical and we need to reconfigure this, we need a provision storage off of that, and have everything mesh up in the end. So for me, this one was really cool and really exciting, but hopefully I won't oversell it. Let's see, where's my notes document? Oh, the first one I actually, I forgot to bring up is that a reminder that there will be a, or excuse me, there will be no Ask in OpenShift admin live stream next week. So February the 16th, we will not have this live stream. However, there will be the what's new in OpenShift 4.10 presentation happening at the same time. So essentially, if you're not familiar, the Red Hat product management team has two live streams that they do basically every quarter. So one is what's new, it happens with each release of OpenShift and it describes all of the new things that you'll find happening inside of that release. We typically schedule those for 90 minutes. I don't know that one has ever gone less than 90 minutes. Usually it's closer to two hours. They're really thorough, they're really good. The PM team does a phenomenal job of, you know, listing out all the stuff that's happening inside of there. Typically here on the Ask in OpenShift admin live stream will follow up with that and I like to pull out a couple of the things that I think are really important to you all and kind of dig down into them a little bit and explore what that means for us. So if you happen to be watching a live stream, feel free to, you know, send me a message and let me know, you know, hey, can you, can you pull this apart? Can you dig into this? Also while you're watching, be aware that there will be several of us, definitely me. I think Christian usually joins Eric, who Eric does the gaming live stream and all that other stuff. So there'll be several of us here on the live stream to help answer questions. So if you have any questions, comments, et cetera, as they're going through that presentation, be sure to leave them here in the live stream. We always shuttle those over to the product management team. So if there's anything we can't answer, we can get those answers from them. And of course they always like to know what your thoughts and opinions are too. Okay, so that being said, do subscribe to the channel if you haven't already, whatever platform you happen to be on. We do have a number of really interesting shows coming up towards the end of the month. So I don't have the list in front of me. So I don't remember off the top of my head what they are. I've been completely consumed with one of my tasks this week. I've so far, it's only Wednesday and I've spent like 20 hours on this thing. So my brain is only half here. But yeah, do please get subscribed. If you don't want to subscribe on YouTube or on Twitch or anything like that, definitely check out. So it's red.ht slash live streaming. That's page. That's the landing page for Red Hat live streaming that has the whole calendar of all of the live streams there. And it is a Google calendar. So down in the lower right hand corner, there's a little thing that says add to Google calendar. So you can subscribe to the calendar there if you'd like. So that way you can get notified if we cancel. So for example, if you're subscribed to that calendar, you would see the canceled one for next week, as well as supplement your brain refried turkey. That would not be a good idea, Tiger. The tryptophan would be counterproductive. Oh, Stephanie, thank you very helpfully. So hopefully we'll be talking about ODF essentials. We might be doing a dive into that and what that looks like inside of clusters. I think there's some other 4.10 specific things that we wanted to call out as well. One of the ones I wanted to talk about is L3 mode with, you know, so BGP mode with metal LB. Inam State, yes, thank you. That's another good one. So I have a show scheduled. I don't have the guests lined up yet, but Inam State is going GA, we think we should be going with 4.10. And there's some really cool stuff that will be happening there at the installer level. So stay tuned for that. I'm not going to spoil the PM team surprise. All right. Moving on. We had, so internally, a lot of the redhead folks, particularly the ones that are close to the BU here and engineering, we had what we call the OpenShift Networking Summit last week. It's an internal thing and we talk about all of the other or all of the things that are happening with networking in OpenShift. And of course, one of the subjects that came up was, well, OVN Kubernetes versus OpenShift SDN. And it's been talked about since we released OVN Kubernetes as a fully supported SDN option back in 4.6, I think, 4.6 or 4.7, you know, which one of these do I use? Which one should I choose? And really it comes down to the default continues to be OpenShift SDN. And as of right now, there's no real, there is a goal, a longer term goal to change that from the default. I think if you remember, we had maybe a year ago, I don't think it was a year and a half, but about a year ago we had Mark Curry on the live stream and Mark had said, you know, yeah, 4.9. We intend to switch the default to OVN Kubernetes. So that's been pushed out for a number of different reasons. The default continues to be OpenShift SDN. So my perspective, my personal opinion is continue to use the default unless you don't have a reason to. So if we share my screen here in the documentation, we have this, you know, what are the different features available? And you'll notice that OVN Kubernetes has a number of different things that it's capable of doing. So for example, IP second encryption, IPv6, that OpenShift SDN isn't capable of. So if you don't need any of these features, OpenShift SDN is the default makes perfect sense. I'll post this link into the chat here. OpenShift SDN continues to make perfect sense. And the good news is you can migrate. So if you deploy OpenShift SDN today and you discover that you need OVN Kubernetes, you can migrate in between them. So you can see here in the documentation, the very next page is migrating. So today, right this moment, the only thing that, or the only thing that comes to mind that would prevent you from doing this migration is if you need features that can only be turned on at install time. And specifically what I'm thinking of here is the hybrid networking option that's required for Windows nodes and Windows containers. So that has to be turned on, you know, it's basically an option when you do the install. It can't be turned on day two. However, I think I heard a rumor that that might be on the roadmap. So that would be one less reason why, you know, you shouldn't do the default today unless you know today that you want to use OVN Kubernetes. And then if the need arises in the future, you can always make that switch. Catching up on chat here. Please post the URL of the blog you're showing. Yep, I think we did that. But I'll go ahead and post it again just to be sure. So and it's that Ansible.com link that I just posted into chat. And aside from that, yeah, what about Tripti? Tripti, I wish that would be nice. I keep trying to convince my wife that we should go to that part of the world and my wife is not a flyer. So while I have been to, you know, the APAC region multiple times, I've been to Australia and all of that a few times and absolutely love it there. My wife is not on board with the, you know, 15 hour plane ride to get there. Do you recommend using another CNI like Calico or Cillium? That again goes to what are your needs? What are your requirements? So Calico and, you know, Tigera is the company and their open source or their community one is Calico community, I think, or yeah. And then there's Calico Enterprise. If you need those features, if that fits with what your model is, with the things that you're doing, then yes, by all means use Calico, Cillium, use whatever one happens to meet your needs. I have no preference between them. I know Calico offers a lot of really cool things. So like they do all layer three SDN. I think that's one of their options. So everything is routed. Everything is, you know, there's not an SDN on an SDN or anything like that. I think that's one of the things that happened to be using another SDN at the hypervisor level, stuff like that. So they also have some pretty powerful network policy stuff. You know, we talked, was it two weeks ago, three weeks ago when we had the VMware folks on, you know, we talked about the possibilities there with Antria and with NSX, same sort of thing. You want to use those capabilities? Great. That's wonderful. It will absolutely work with OpenShift. It's certified. I think we'd have a whole certification program for that. No harm in doing that. CNCF community had a webinar last night and using Submariner make services across clusters. Yep. So Submariner is, as far as I know, and I will be clear, I'm not the expert here, as far as I know, decoupled from the specific SDN. So essentially it creates a layer three overlay network between the clusters so that they can communicate, you know, it's an SDN to SDN communication without having to route outside of the cluster, if that makes sense. It's not going through the egress ingress path. I'll put it that way. So Submariner is definitely a thing, it's integrated with ACM now. So if you create a cluster pool inside of ACM, you can then go in and tick the box and say deploy Submariner and then it will deploy it across there. So I'll see if I can get some more details there. One of the things that we had talked about in the not too distant future was talking about service mesh and I think that integrating in with all of that would be a great opportunity to talk about Submariner and some other various networking-y things. Depending on the scope of that or how large that topic is, we may even have a dedicated stream for it. So keep an eye out for that one. And of course I always try and remind people of what our upcoming topics are when I'm paying attention to my own schedule ahead of time. So yeah, EBPF. Yeah, there's lots of really interesting stuff that's happening in that area. We should definitely have Mark Curry on. I'm gonna take a note for that. All right. So the last thing that I wanted to talk about real quick here is OpenShift Commons. So if you're not familiar, OpenShift Commons is kind of our community set of stuff. So the Commons group regularly has these Commons gatherings where they talk about a bunch of different topics. This one just happened to pop up because Christian is a friend of mine and he's always telling me about get-off things. But definitely if you have some time, you can see it's right when our stream ends. You can hop over to hop in, no pun intended, and watch their event that's going on there. I'll go ahead and post the link to this. So I'm sure that there'll be lots of good information here. I actually haven't looked at the schedule myself. Yeah. So Christian's actually participating. That's probably why he was harassing me about it. So lots of really good and interesting stuff. Christian, who I was just helping him review a chapter for his upcoming get-offs book. So there is just a ton of stuff. And if you're not familiar with get-offs, it's definitely something that I am encouraging everybody to get familiar with now. You remember last week when we had the validated pattern steam on, all of their stuff is driven through get-offs. So it's just becoming more and more prominent. And I've been doing a lot of work this week with ACM. We can leverage get-offs in order to deploy applications across clusters as they're joining, not just applications, but features of OpenShift. So it's really interesting. There's a lot of power behind that. And it's only getting better as time goes on. So I guess this time I have to send Christian to check because I'm the fanboy, as it were. Tiger, I'm a flyer. Yeah. I know, Tiger, you live. I know you were here in Raleigh for a while, but you actually live on the other side of the world. So any other comments here? I don't think so. So I think that's all of the top of mind topics. Oh no, I missed one. So there's a question internally that came in, which is, is it possible to disable the metric service? The person was specifically asking about Prometheus, but really it's the metric service broadly. And the answer to that is sort of. So I brought up the documentation page here, which is, and the important thing here is we can effectively set the operator to be unmanaged and then basically have it uninstall all of those components. It's not a supported option. It's not a supported action inside of the cluster. And the reason for that in my perspective is because there's a number of features and capabilities that rely on metrics, you know, a horizontal pod autoscaler, vertical pod autoscaler, right, all of those things rely on metrics data in order to function. So if you disable that, then of course those things won't work. So yes, it's technically possible. It's not supported. It's not encouraged to do that. This person, I think what they were doing was doing some testing and they were finding that the metric service, the metric service was, you know, collecting data, was doing its thing and was adding a little bit of overhead when they were trying to maximize performance. And they wanted to just disable it to see what would happen. So the workaround in that case was simply to move all of the metrics resources. So Prometheus and Thanos and all the other stuff with that. And put that onto infrastructure nodes. And then at that point it gets out of the way and it's just the endpoints, the metrics endpoints that are still doing your thing. Yes, alert manager, very much so. It kind of needs to collect metrics to know when something is out of line. Nate is choosing a third party SDN and install time decision and how much complexity does it add to the install. Good question. So generally, yes, it is an install time decision. I'm trying to think. I don't know of any off the top of my head where you can migrate after the fact but considering we have a migration from OpenShift SDN to again Kubernetes, it wouldn't surprise me if it's possible. The complexity is really going to vary. So at a minimum it's going to be something like generate my OpenShift install and then do a generate manifest and then probably dump whatever the YAML files for deploying the SDN are into the manifest folder and then continuing with a normal installation from there. I know that's how it works with NSX, the VMware NSX container plugin. And I believe that's also how it works with Calico. So the complexity is going to vary. I don't know if Flannel is a supported SDN but that one's one of the quote unquote simpler ones. Maybe it's a little bit easier. Maybe it's just less stuff happening underneath the covers. I'm not quite sure there. But yeah, it shouldn't be too hard compared to many things, just an extra step at the beginning. All right, so I'm out of top of mind topics. I don't think I posted this link into the chat. Let me go ahead and do that real quick. Okay, so if anybody wants to take a look at that, that would be how you can forcefully disable the metric service inside of the cluster. All right, so I built it up a little bit here at the beginning of our session here, talking about the performance add-on operator. And this is one that the name is really intriguing. And it sounds like it's this magic tool, like tune D, right? I can use a tune D profile, and I can set all these things inside of my cluster and magically my performance is improved. But the performance add-on operator, it really takes it like a step or three further in that you can modify effectively any and all of the low-level Linux CoreOS settings and configurations inside of your cluster. So while it's incredibly powerful and I can do a lot of stuff inside of there, it's also just a teensy bit dangerous, right? We all like to live a little bit on the edge. So, Frank, I'm curious and thank you very much for joining me today. This one, I know we've been planning it for a little bit. And I'm excited to learn more about this. So can we start with like, what's the use case? As the product manager for the performance add-on operator, what's the use case? When should customers, when like, hey, I need to use this versus I want to use it versus you probably shouldn't use it. Okay. So when do you want to use a performance add-on operator? So not necessarily for performance. It's really a very specific tuning that you need. So I'm a Tenco product manager and Tenco is an open shift on bare metal. And you need to have a very low-level tuning to make sure that you have the proper configuration end-to-end so you can run what we call CNF container. It's a cloud-native network function, which is basically you want to make sure that you will dedicate, really dedicate some resources to some containers. So you make sure that your container will never be preempted by the scheduler, by an interrupt. So this is where you want to use a performance add-on operator. Basically, fine tuning to make sure that you can really dedicate resources but not from a Kubernetes perspective but also from an operating and system perspective, I would say. So that's number one. And on top of that, because a performance add-on operator is an orchestrator of many underlying components, you have some very funny use cases. Like you want to configure a kubelet parameter and you can do that with... You can change any kubelet parameter on your cluster with performance add-on operator. You can use to do it also. But the thing is, as soon as you start to configure a kubelet, you want to make sure that you have one source of kubelet configuration and not two components colliding in competition and at the end you end up with something not coherent. But anyway, let's say that if you think that you need a Tundee profile... So first you need to know what's a Tundee profile. If you think you need to do some... I need to do some interrupt rebalancing. I care about numalocality, about my resources. And finally, potentially, I need to have a specific boot parameter on some worker node. This is where you should think about PAO. And when you start to use PAO, you have other tools that you use to... potentially use to configure kubelet and you should forget about them and PAO becomes your front-end for the whole configuration and if you configure the one component at two places, guess what happened? Yeah, I'm really glad that you brought that up because it was a question that I had in my mind as well, right? Because there's a Tundee operator, there's machine config operator, and there's multiple ways of setting these... sorry, the node tuning operator, not the Tundee operator. So, yeah, there's multiple ways of setting these things and very much to your point, and I think it's a really good point of you don't want there to be conflicts. You know, it's hard to... Oh, I'm setting these things over here, I'm setting these things over here, I'm setting these things over here, you know, consolidate, you know, standardize on one location and utilize that. And it's even worse than that. You want also to configure cryo parameter, very low-level cryo parameter as well. And you want to be coherent between your cryo, let's say you want some specific CPU in cryo configuration, you want the same in the boot parameter and you want the same CPU list in the Tundee profile. And if you do it manually, you may succeed, but you will need to rewrite basically PAO. So, why not use the tool that's there? Yeah, that's an interesting one. You know, a lot of folks, you know, especially I see chat about this all the time internally of, oh, I need to make this low-level configuration change and the kind of default answer has always been, you know, machine config. Create a machine config, have it apply to the nodes and then let it, you know, basically you're laying out a file inside of the standard Linux file system to have it do that configuration. And, you know, is that always the best option? Maybe not. So, if this is something that you, if you're sure you're creating something new, and by the way, it won't be a low-level configuration, so maybe, but if you want to change the cryo configuration, you don't want to go via MCO. If you want to change the cubelet configuration, you don't want to go via MCO. So, MCO is something like, something that you would like to add rather than configure, I would say. Or, and you need to be very aware about all of the consequences about what you do. So, yeah, so first thing first, can you use PAO and MCO? Yes. Of course, we're always doing that. To set the same parameter, no. So, this is where you need to have a good understanding about which parameter you want to set. Yeah. So, sorry to interrupt. And Nate brings up a good point here, which is it sounds like the performance operator, performance add-on operator is a good fit with specialized worker nodes. So, he lists out here nodes with GPUs, special local storage. I think we can include, you know, SRIOV or DPDK network adapters. You know, maybe special use case nodes. I'm thinking like, you know, VRan or something like that, where the real-time kernel is used. Yes, indeed. So, the performance add-on operator is also helping to configure the real-time, no, so, installing the real-time kernel is one thing, but then you need to have all of the associated boot parameter, the tune-d profile, because when you install real-time, you want real determinism, you need to have a cryo-config with or without hyperthreading, because you can be more or less soft on your real-time. So, the best practice to be hard real-time is your disable hyperthreading, but when you want more throughput, you enable hyperthreading. So, PAO can do that for you. One thing we didn't talk about is, if you think about disabling hyperthreading, PAO could be a good fit, and some applications are faster when hyperthreading is disabled. And this is something that the application developer should know, because his application, for instance, requires a big cache, right? Yeah. And when you have hyperthreading on, basically you divide your cache by two. Yeah, it's, you know, I'm an old-school virtualization admin, and when we used to do sizing exercises, you know, it's a hyperthread only counts as half of CPU when you're doing sizing. You know, if I have four cores with hyperthreading turned on, so it's eight virtual cores, but from a capacity perspective, I don't want to count that as six. The four full cores plus two. Which I'm curious to see, or curious to know, you know, with Intel's latest architectures, now they're coming out with efficiency cores, right? Performance cores, efficiency cores. I can only imagine if that makes it into the server hardware market, what that would possibly look like. That could make things really interesting. Yeah, because on your server, you will have a pool of faster CPU. Yeah. So, meaning the other is there. So, and this is something that we work on upstream with Intel to enable it. So you would have some kind of first-class workload, very demanding. And so, okay, give me those CPUs. But if you start to be in this, if you want to consume them, you may have a very specific use case. Could be Telco. So Telco can be radio access next work, as you said, and the most stringent in Telco is VDU, as we call it, which is really close to the antenna. And then you have the CU, and then you have the core, and now you have orders, okay? So the real time is really for the, really close to the tower. And the other use cases, so far, don't use real time, but use fine tuning. And with PAO, so if we step back just one thing, Kubernetes is about sharing and cloud native. Telco is about, okay, I want determinism. So it's kind of the opposite because either you dedicate, either you share, okay? So the idea is to find the recipe, the trade-off, to be able to co-locate isolated workload and over-committed workloads. So we are globally efficient. So you have some kind of IT-like pods that can be over-committed, running on hyperthreats, no problem. And you have the others that cannot suffer from an extra cache miss on the same platform. So this is something that is going to be one level beyond with Intel UCPU. And do you want this CPU to run more over-committed workload or dedicated workload? I would say this is up to your application. But when you start to think about it, it's really application-specific. I think that FSI, HPC, so even, so it really depends on the workload, okay? The workload designer should know if the workload designer is using Perf, Perf, kernel Perf, Intel V2 to provide the application, look at CPU cycle, okay? You may want to use PAO. It really depends on the... I can imagine, you know, AI, ML type workloads, all of that. So there's a couple of questions here. I'm actually going to ask them in the reverse order because I think, Nate, I'm going to use your question to ask Frank to do a demo on there. So PNUW asks, when doing .NET app modernization and basically having Windows Server, Windows containers in the cluster, does performance add-on operator work with those Windows nodes? Nope. I suspect it as much. Machine config doesn't really work with Windows nodes either, but just in case. No, no. So this is... So we are right at that. And we provide a very low level tuning of Red Hat Linux that can be packaged as a CoreOS and shipped with OpenShift or OpenStack or other product or as Plain Rel. And this is where our knowledge is, okay? And this is where we have a team pushing patches in the scheduler to make sure that, okay, nothing is going to disturb your workload because you always have a piece of the kernel that want to run on all of the Core at any time to gather some statistics to do some LRU, this kind of thing. And this is what we want to tune. So to avoid... So basically the... It's a lot on isolation and make sure that your application is not going to get any CPU cycles stolen. Yeah. So I thought I was going to use Nate's question to... I kind of ask you to launch into a demo. So Nate's asking, with performance add-on operator, can you use labels, node selectors to limit the nodes that the profiles are being applied to? And I suspect that if we look at, you know, how to deploy, how to use, how to apply the performance add-on operator to a cluster, that'll probably answer a lot of those questions. Yes. So, unfortunately, on my cluster, I only have two worker physical nodes. So they are all on the M... They are on the same machine configure... machine NCP. But if you have multiple NCP, basically you can have one PAO, one profile performance, performance profile per MCP. So that's as many as you want. Okay. You even have in the retents and you can do things that are very complicated. On top of... something complicated. So we can go through my setup and the values parameter if you want. The deployment of PAO is well documented. And... but... Yeah. And the point is... what I'd like to talk more if it's of interest of the audience, of the use cases and what does it mean this parameter? Because you have two kinds of users. You have users which come from a very low level and discover Kubernetes and they know what they want to do. Exactly. And say, okay, how do I translate my Linux configuration into an OpenShift configuration? And... I can help these people to understand how to translate their servers because there are a lot of things in between that you need to configure that you are not aware of when you join the Kubernetes ecosystem like cryo, etc. And you have others which are more cloud native people trying to discover the low level stuff. So here that's going to be an introduction because at some point you need to read the programming guide is 900 pages and you need to read it. Sorry. A little bit when you said that you can have a performance I'm going to call it a profile. I don't know if that's the right term per machine config pool. That kind of makes sense because you would assume that all of those nodes are going to have the same configuration the same set of hardware the same set of things. To me that is logical. I also like when you describe it as two groups of people like that because I think I probably fall into an application now I need to tune it and figure it out whereas to your point earlier in theory the developer the application team should know exactly what they need so they should be communicating that and working from the bottom up. So we can go through so let me share my screen so as you can see here this is a cluster with three masters and two worker nodes so these worker nodes are so before they are Skylake 72 cpu total so you divide by two and you have the number of cpu and you divide by two and this is per NUMA node so if we if we go on one of these compute nodes so just for the audience so those are reference machine from Intel so my lab and the code name was Wildcatpass so Wildcatpass so I've got three of these physical machine and one of the machine is hosting my three virtual masters so just for your information and these machine cat and pass are connected back to back with cables so I can do a CRIOV and have a traffic generator on cat and a CNF running on pass and push millions of packets per second and check that no cycle has stolen because no packets are are missing so okay so I don't have pneumatic on this one so what's important on this machine and I'll be brief so as you can see I've got a lot of cpu, I've got hyperthreading and I've got two NUMA nodes plenty of memory and what I want to show you is CMD lines so those are my boot parameters so and as you can see I've got Tundi running with very specific core mask I've got huge pages I've got SimStyle DFINITY I've got IOMMU1 and if I go to my interrupt RQ default so the default SMP DFINITY is not FFFF because I've got already pods running so this is something that I will explain a bit so if I tear down all of my pods you will see that the mask of the core available to process interrupt is FFFF and as soon as I start an isolated pod the cpu is not available for interrupt anymore so they are pulled away so this is thanks to the Tundi profile that is applied so this is what you have so this is my physical working node so now back to my so let's have a look at my performance profile so I can do many things here I'll be quick and then we can go back on it but so my performance profile is named DPDK ready as I can say so I want to configure some cctl which are unsafe so because my cnf wants to be able not to accept router advertisement ipv6 so I can do that with pao with this syntax so it's called experimental because we have hope that at some point we are going to be able to have a hierarchical tree like rc.1 rc.2 for cubelet but this is the wrong shot and even if it's experimental it's fully supported by our teams and for our customers so here this is my cpu topology so I've got dual socket system with all of this cpu on socket so this is socket 0, this is socket 1 and the siblings are this way so 0, 36 so that guy is sibling with that guy I've got hyperthreading on and I split my cpu between reserved and isolated so reserved if you see this one so here I've chosen to let two cpu per Luma node that's a lot, usually you just let one it's for the sake of being so it's to do some rnd so that's my lab so one thing is and those are isolated so when you look at isolated you think okay they're isolated so I cannot use them and this is where sometimes the labeling is misleading these cpu are a candidate to be isolated but until a pod is isolated they're going to run whatever needs to be run so this is what's a little tricky isolated is not really isolated and this is a Kubernetes syntax it's not a PAO syntax so then I configure the huge pages so 64 in total so 3232 this is how it's working so one big huge pages I don't like to make huge pages why do we use huge pages by the way do you know so if you're asking me it would be purely a guess and that is because we want to avoid additional allocations in the memory map so it's one big chunk instead of a bunch of little chunks so almost because you want to avoid the cost of a TLB miss with huge pages you don't have a TLB miss and a TLB miss is one of the most expensive things that can happen to you in terms of cpu cycle could be it could be like because you have to work on the for you on the on the page table the worst case scenario is you have a four touch so you can have like four cache misses in the row and that that TLB is that related to NUMA or is that something else that's per cpu each and every core if you want they have their TLB which is a little cache of resolution of you know virtual to physical and when you have one giga your resolution is already done for all you don't need to to do some physical to virtual resolution and just out of curiosity because I'm learning things is there rhyme or reason to sizing those huge pages so this is a hardware constraint so the hardware the intel cpu that I'm running on offer to make huge pages or one gig got it so with two mega you just you it's not enough for the size of the application that we run usually because cnf I kind of like it's very common to have like 32 giga process so you want to lock off your all of your memory into huge pages so you have zero zero cache miss because again in cnf in in telco the unit of count is how many cpu cycle per packet so to process one packet how many cpu cycles I am going to consume and Thomas asks TLB tango lima bravo is that the correct acronym so that's a translation look aside buffer so it's a when you have so it's back to the concept of physical memory and virtual memory so your cpu manage virtual memory but it's back by physical memory and you need to do the mapping right and to do the mapping you have page tables and every time you go because you want to resolve a physical address to virtual address or you may have a page which is missing so you know when you go through the memory so one page is missing so you allocate the memory on the fly with one gig huge pages your one gig are set and locked in memory so it's not going to go away to be swapped out for instance it's going to stay in memory so basically you you're super fast but got it but you don't share this is all interesting to me you know I have had some awareness of NUMA and kind of the you know when folks would want to use it again I'm an old virtualization administrator so you know we would have workloads that need you know NUMA awareness that need to be penned to a specific NUMA node you know for that performance but learning the details this is all super interesting to me I also think it helps to understand when it is and isn't applicable because there is such a thing as overtuning too right oh yes yes I try to avoid because when there is a parameter I don't understand I just don't set it yeah I've always it's one of those things of and I know you're going through and you're setting a whole bunch of things inside of here I always tell people change one thing at a time so that way you know if it changed positive or negative you know what caused that change yes and every time you change a PIO parameter often you need to reboot your worker node translate a lot of these parameters are translated to boot parameter parameter plus other parameters so reboot is necessary so it can be time consuming and Nate you're correct there right it's huge pages avoid the TLB misses exactly and TLB misses is what you want to avoid because this is one of the most costly event that you can have because you're executing code and then you access to memory and boom you have to work through the pages and you have like 5 cache misses 4 cache misses worst case so because we have no virtualization with virtualization it's even worse because you can have like up to 16 so you want to have huge pages and about single Numa so Kubernetes permit to have so you have values policies with the topology manager and you can ask the topology manager to say okay my pod will have all of these its resources on the same Numa node what are the resources memory devices CPU why that? because every time you cross Numa your performance is going to suffer the suffer big time so if your memory is on Numa node 0 and your CPU is on Numa node 1 you need to cross the bus between the two sockets to access to your memory same for your device so because generation after generation so the chip vendor they have done a great work and the throughput is okay but still you have latency penalty and latency matters because it's not only how many packets per second I can get in and out of my network application it's also the latency in particular when we think about 5G we have a very limited quota in term of time to get through the whole run to the run to the role core so a packet needs to be processed and send out on the wire as soon as possible because your quota is ridiculous compared to 4G so single Numa it's really the locality between memory CPU and devices being in the case that primarily network interface but it can be GPU it can be a look-aside accelerator like a Fourier transform accelerator or a crypto engine I know even some high performance databases benefit from Numa awareness and all that and PNUW thank you for linking the documentation there that's very helpful so Frank I feel like I've distracted you long enough I want to make sure that you have time to finish that's fine and then you have all the documentation and the last part is those comments are my comments and they are related to my platform because next week I'm on PTO for instance when I come back from PTO I say ok why did I put this parameter ah ok so one thing which is very important is in PAO you want to make sure that all CPUs are listed here because if you forget one CPU, Kubernetes is going to schedule workload on it and there will be absolutely no tuning applied and Kubernetes may schedule an isolated pod with containers that require isolation and it's not going to be isolated so you want to make sure that all of your cores are isolated or reserved or no CPU left behind and again isolated does not mean isolated it's come the day to run isolated workload and as I can show you by killing my pod after you will see that my interrupt my interrupt mask is going to be increased to cover all of the all of the CPU as soon as my isolated pod will be destroyed so it's really dynamic so basically with PAO you have specific boot parameter but from a Kubernetes perspective until you boot until you start a pod with dedicated resources you can use it as a regular over committed node and a quick follow on question here from Nate are T.O.B's are they a CPU feature or a kernel feature? it's a CPU feature it's built in the silicon and finally something I'm going to show you on my working node which are related to interrupt so why am I talking so much about interrupt? so interrupt are asynchronous event that are going to trigger kernel activity to process packets typically or to do IO when you have interrupt related to disk and every time you run the kernel well you don't run your application so you still cycle from an application that needs 100% of the available CPU so that's why you want to get your interrupt away and this net is something which is interesting because by default when you start a Linux machine for a lot of drivers you have one interrupt per available CPU so in my case I've got 72 CPU so for each and every network card I've got like 72 interrupt vectors so that's a lot and the point is if you want to so it's a little complex but if you want to avoid interrupt on an isolated CPU you have a hardware constraint you want to make sure that you can pin this interrupt to a CPU which is not isolated but you have a constraint which is on a given CPU you cannot put more than 224 interrupt because that's a hardware that's part of the silicon so you have a limit at 255 but you have 32 already taken so 224 so if you end up with 600 interrupt on your system it means that you need at least like 3 CPU not isolated to handle all of the interrupt because if not the hardware is not going to tell you and the CPU and one of the CPU is going to process the interrupt vector and you will not see that's called interrupt overspelling so you want to pack your interrupt on 2 CPU but you have too many interrupts that the 2 CPU can not process but accept in their interrupt table so that's why here we have this option in performance add-on operator to say ok I'm going to have as many interrupt vector as my reserved core and the rationale is that the reserved core are here to do non container workload if you want like running system v OVS sorry etc and this one can be interrupted so and also you don't want to have only one interrupt per nick because you want to do some multi-threading because if at some point you have a boost of network traffic you want to be able to spread the load on multiple CPU so basically with user networking equal to you are going to spread your interrupt on this reserve CPU at least because you are going to spread your interrupt on more CPU basically all the reserve plus all the CPU which are not running isolated but ok I'm going fast it's complicated well and just to be clear so effectively without having those CPUs reserved for interrupts it has kind of two effects right so one is it has it literally interrupts the processing of the application which potentially slows down the application but two because you're having to rely on the CPU scheduler across you know all of those already busy CPUs with the application it can potentially slow down things like network output you know so if you're pushing 100 gigabit or something like that and your CPUs are running near capacity suddenly you're not able to hit that throughput that you would like to instead it's something lower exactly so very briefly so this is my performance profile now let's have a look at my pod descriptor so in telco we have a very popular application which is the dpdk test program named test PMD and I've got a lot of annotation here but what matters in this in this container ah it's like it's dedicated resources right so you have east you have east and west sorry because what's running is not what you see I've got too many machines I have that issue as well so this is the right one so so what's interesting here is for instance ok this is a cctl that you have seen and I'm setting the value to zero and I can't do that because I've put that in pao so it's a safe here so I'm having four cpu two pages of one gig of huge pages and one gig of generic memory because you always need a good memory so that's basically what I have plus two SRIOV interfaces and I've got some extra annotations here so very important you need to put the runtime class which is the name of the performance profile that you can get by following the documentation so what do I do what I say is interrupt load balancing disabled so I disabled so this what I'm saying is when you start this pod please move all interrupts away from my cpu ok so this pod has four cpu no interrupt on these four cpu so in the former version of OpenShift B44.7 by default that was a general option but now it's opt out so each and every pod needs to say ok I want to get the interrupt out then this one is we have a bug in cryo that is going to be fixed upstream at some point but until this bug is fixed you need to put this quarter disabled because even if you request a complete cpu cryo is going to cpu throttle your workload so this is bad so you need to put this option and finally you can also unplug the cfs the linux fare scheduler so here I've got four core four cpu, two cores for hyper threads so when I start an application by default the linux scheduler is going to try to reschedule on one of the four cpu and make it the best to spread the load so we want to disable this behavior most of the time because we want to we want to avoid extra kernel activity so here what you said to the core so basically you disable the fare scheduler so once a task is started on one cpu it's going to stay on this cpu until it dies but the application on top can do via task set the pinning on another cpu so it's not like so the linux scheduler is not going to move the application to another cpu but if the application wants to it's possible also the application can start so this is what it's going to do and this is the behavior that you had for the people coming from the old time when you had ISO cpu domain as a good parameter so that's the exact thing you had a cfs disabling so this is a way to get this feature so once you get pods i've got my testpmd pod running it's running on the wider so sorry on my machine pass so and have you seen before debug so let's go on pass so if i go on my machine pass you will see that so i've got some cpu exactly four because c is 1100 so i've got four cpu which have been taken out so i'll be brief but what i'm going to do to show you this is you see the like tmd back to pass so the pod is no more there here q and as you see any cpu can take interrupt again so every time you put an isolated pod with a proper annotation this mask will be updated and all interrupt of the system are going to be moved of the cpu which have been allocated to this pod so these cpu are really isolated from kernel isolation of course they are isolated from other pods from Kubernetes so but this is really the low level of things that people don't usually think that okay i've got a kernel running and kernel can print my userlet i'm just catching up because i was responding to a chat merafet i don't know whether or not alert managers supports SNMB traps as a destination so i'll have to check with one of our SMEs and get back to you on that so you can tune in in two weeks we'll talk about it at the top of the stream during the top of mine or feel free to send me an email andrew.celebanatredhead.com and as soon as i can get that answer out to you also included in the blog post that follows as well so keep an eye on cloud.redhead.com slash blog mark curious to know about hyper threading for an open shift cluster running in virtual machines seems to make sense to set the VMs to use two threads on the virtual cpus so that's an interesting question in that so everything that you've talked about here is effectively open shift running on bare metal and having access to directly and natively control everything about that system does this still apply if we're running in virtual machines do you need to configure things at the hypervisor level to pass through or uh no no because uh when you run in a virtual machine you run on a virtual server a server is a server so any server is exposing you hyper threads or not so it depends on how Linux is configured and the hypervisor is configured to show you uh cpu which are number 01234 or something a little more fancy showing you up a pneumatopology and hyper threading so basically you know what's a VM it's just like you enable vtx and you run the same thing so of course and by the way we have an open stack we have a complete end to end tuning for telco so it applies so I did a demo of this before covid in danger so if you search for my name open infrastructure meet danger we did a demo live on stage that was Kubernetes on open stack end to end with all of the tuning manually applied without pao and I'm glad we have pao now speaking of open info summit I think we missed it uh paper submissions session submissions I think it was yesterday at like 5pm central time so unfortunately if anybody wanted to submit a session we missed that one um mark and rev I noticed that the recommended setting for x86 architecture is one thread per core slash cpu um so that's going to vary um so I think what you're asking is you know if the physical server has hyper threading enabled the hypervisor is taking advantage of that um you know should you basically take advantage or use those hyper thread cores for your uh your open shift nodes and the answer to that is maybe um right it'll depend on a bunch of different things in your application I always um you know encourage people to or remind people that over subscription at the hypervisor level can significantly affect performance at the open shift level so you know in vmware there's the performance latency setting where it basically reserves all the resources um you know in rev there's things with iothreads as well as cpu reservations and um uh threads per cpu as you're calling out there that can all affect that performance so if if you are seeing performance degradation in open shift as a whole or or most especially the control play nodes utilize those hypervisor features to kind of dedicate it or give it more priority inside of the resources that it has um you know built our cluster before the rev integrations um so I don't think the red hat virtualization vms are using anything special um so please reach out to me we have a the rev engineering team they're really good about answering those questions and they're really curious about that stuff so and I know the product manager would be very interested in any use cases or concerns you have there so mark feel free to um send me an email andrew.sullivan at redhat.com I posted it just before your chats there um thank you Stephanie for putting my contact information up there that works too excuse me so um I know we're about six minutes over according to my clock here I don't want to intrude into your evening I know over in France it's uh it's getting later in the day and I definitely don't want to uh to interrupt your uh your off time your family time Frank so if anybody has any last questions please go ahead and submit those into the chat um you know we'll address those as they come in um is there anything else you want to uh to cover to talk about I'm trying to I'm digging through our our shared document here looking to see if there's so I recommend people to read the pao documentation the page is really long it reminds me of the of the tcp the tcp code of the bsd stack because it's one file with everything in all the tcp stack in one file so it's pretty the same size but uh everything is in there and at the bottom you you will have something that you know when people talk about a lot about devops but at the end what the best practice is when you do something you want to test that what you what you get is what you expect so when people use pao you need to think about okay how am I going to make sure that I've got the isolation I want and in this page you will have some handy tool explaining you okay how to make sure that okay if something I was wrong you run a tool and you will see I see cpu preemption that's bad if this is what you want to see for instance so uh it's it's not only about understanding um it you need to think about how you're going to test it and we have great tools and uh uh linked into the product page all right well thank you so much for joining today um this has been like I said it's been a really interesting one and I've learned a lot about low the the lower level performance tuning performance capabilities stuff that I've had I'm going to say the luxury of not having to pay attention to over the last few years but nevertheless it's still really interesting to me again it it punches all those buttons for me the the nerd knob so to speak for our audience thank you for joining us today we really do appreciate it uh again if you have any questions if there's anything that comes to mind after you're after you have watched the stream or if you're watching it not live feel free to reach out you can reach me on social media at practical Andrew on Twitter or via email andrew.sullivanetredhat.com and of course I'm more than happy to uh to bring Frank into the conversation if if needed to help answer those questions one last reminder we won't have a stream last or excuse me next week because of the what's new presentation be sure to tune into that in two weeks we'll be back Johnny will be back off of PTO so it'll it'll be the whole crew here um Stephanie in the background thank you so much for all of your help today and posting links and keeping us on point and with that I hope everybody has a great rest of their week great weekend and stay safe out there thank you