 Let's get into the big topic of open source, something that we actually have in front of us. This is so awesome. We are an open culture that is actually able to fix that process that a developer or, let's say, as the community's ecosystem really brings. Welcome to this week's episode of Ask an OpenShift Administrator Office Hours livestream. I am your host, Andrew Sullivan, technical marketing manager with the OpenShift business unit. And I am joined, as always, by my lovely co-host, Mr. Johnny Ricard. Hey, how's it going? It's going great. It's another beautiful red hat Wednesday. And now I'm hoping that my stream here doesn't crash because I'm getting an error in the interface. What's the worst that happens, right? I saw that too, so yeah. Okay, so long as it's both of us, right? We'll all go down at the same time. So, hello, everyone. Welcome to the stream this week. I hope that it has been an exciting and productive week since the last time we talked. This week, we are joined by me, myself, and Johnny. So, you know, we'll be running the show again this week. We don't have a guest, but we have a pretty interesting topic. This is something that we were both kind of scratching our heads after the stream last week, and we're thinking, you know, what are we going to talk about? What are we going to think about? And I don't know if it was a bolt of lightning. I don't even remember what caused me to think of this, but it was like, wait, we can just talk to the fields, right? And so we talked a little bit about some topics, and I ended up sending some emails out to various large distros, large email chains, or an email list inside of Red Hat, and asked some questions about, in particular, like, what are some things that either you recommend or you see customers doing basically universally? You know, we sometimes, and let me think of how to say this, sometimes you'll hear them called best practices, but I think we shy away from using the term best practices. And the reason for that is because best practices are really like, they're a majority thing, right? It's where, you know, at least 51% of people maybe should do something like this. And so they can be confusing, especially if you're doing something that is perfectly valid. There's nothing wrong with it. It's a very, you know, good, optimal thing to do. It's just not for everyone. So I've tried to stay away from the term best practices and instead call it like recommendations or common things that are happening. It was really hard to come up with a terse title for this stream. But yeah, Johnny, you were a consultant for a long time, right? I don't know about your experiences with the term best practices. Yeah. So for a while, I think initially it was like, it wasn't a big deal. And then it became taboo to talk about it. And so then like industry standard or like whatever kind of like common, you know, name we could give it to where it was like, it was clear without having to explain like, okay, it's best practice. Yeah. Yeah. And I also, I'm always hesitant to convey or construe anytime. Anything is like nothing is mandatory. Right? I mean, I shouldn't say that, right? What's the thing about, you know, always never use always and never, you know, so of course it's mandatory to have things like Etsy D or, you know, you can't replace it with something else inside of OpenShift. But like there are a few, very, very, very few like hard fast. You must do this type of thing. Beyond that, it's all situational. It's all depending on what you're trying to do, what you want to do with your applications with your users, so on and so forth. So. Yeah, I like to, I would round that up with my customers with like, I'm not here to tell you, you know, but I'm here to strongly discourage you from doing something like that, you know, and it's ultimately up to them to make that decision because it's their cluster and they have to live with it. Yeah. Yeah. So hello everyone to the, in the chats, Chandler, Tiger, Korak, our hope nine. So it's, it's good to see everybody today. I love that our chat has been much more interactive, you know, basically in the new year, right? It kind of picked up in 2022 here and it's, it's really refreshing to see everybody interacting and chatting. So thank you all. So before we get lost in chit chat here, let's talk about some of our top of mind topics. And I accidentally came up with kind of a long list today. I don't want to spend too much time because I know we want to get to our topic. So let's, let's do these a little bit rapid, rapid fire. So first, a couple of logistical things. Reminder that tomorrow is the what's next presentation from product management team. So the what's next presentation is the roadmap presentation. So you'll see what's going to be coming in 4.11, 4.12, 4.13, but just remember the further out we get the more, you know, the more likely it is to change. So they do it their best effort to predict what's going to happen, but we never actually know for sure. So that is tomorrow, April 14th at 10 o'clock in the morning, Eastern time. So please adjust for your time zones appropriately. It'll be recorded so you can find it in perpetuity here on YouTube. We also have a nice little quick link. It's openshift.com slash next and new all one word no spaces. And that takes you directly to the landing page that has the current versions of the what's new and what's next presentations. So Johnny, I think you'll be here. I'll be here for sure answering any questions, stuff like that. Usually Mike Foster from the ACS team is here. I think Christian will be joining us as well. So if you have any questions, if you just want to chat about the roadmap while, while it's going on, we'll be here. Yep, for sure. Also, just a reminder that coming up in less than a month now is Red Hat Summit. So summit is a hybrid event this year. So there is some in-person, some virtual. I would say vast majority of people will probably be virtual again this year. But they'll be talking about a lot of the stuff that's coming up in Red Hat, not just open shift, but Red Hat as a whole. And you know, it's the usual, usual summit keynote presentations and all the exciting stuff going on there. So if you are interested in that, if you haven't registered yet, definitely go ahead and do that. Pretty simple, redhat.com slash summit. All right, let's talk about some, some actual tangible things instead of, you know, logistic things. So one, and let me share my screen here. Let's see here. All right. So the first one that I want to talk about here is we've been talking about for, I don't know, two, three weeks now, Johnny, this at CD data inconsistency issue. Yep. You know, again, Red Hat has become aware of a potential data inconsistency issue present in SD versions 3.5.0 through 3.5.2, which is used in open shift 4.9 and 4.10. So as of right now, upgrades are still blocked between 4.8 and 4.9. You can still use the fast channel to update from 4.9 to 4.10. It is not yet unstable, but that is not blocked. So the good news here is that I was looking through various BZs and other things this morning and it looks like there is a fix that might be shipping and 4.9.28. So let me grab this link from over here. So this is the. Arata that was shipped. So you can see bug fix advisory. This went out today, the 13th. And if we scroll on down through here. So open shift container platform 4.9.28. And as we go down towards the bottom, we have this fix here for the potential at CD inconsistency revision and data. So 4.9.28 isn't fast, which means that it is a generally available, right? It's a fully supported release. If you want to do that upgrade today, I don't think that they have unblocked. So even if you do upgrade today, you still won't be able to do that update. And I don't know when that will be coming for 4.8. But yeah, it looks like we're on the cusp of getting some fixes here. Assuming everything works as they expected to as they hope it to. So good news for everyone. Yeah, nice. Data inconsistency, you know, data loss is, as a former storage guy, terrifying. We avoid that like the plague. Oh yeah. Any highlights on red hat open shift learning subscriptions? If you can provide some clarification on that correct, because I'm not quite sure what you might be referring to. So learning subscriptions. You can get a developer entitlement to open shifts through the bit. Just if you have a developer account, right? It's the standard 16 node thing. I have not checked recently or followed up to see if they have unblocked the issues that we're preventing you from basically assigning it to a self, a self supported cluster inside of there and getting away from the eval designation. But as far as I know, there's nothing wrong with leaving it as an eval. You know, you'll still be able to connect and get updates and stuff like that. If you're using an eval license. It just if you look in the console that redhead.com interface, it's angry at you. So long as you're sticking to the developer entitlement restrictions, you know, so single user non production use cases, I would, you know, Andrew, you know, not, not a lawyer, not an official representation, but it feels like we wouldn't be too angry with you about that. So any updates or plans for supporting hierarchical namespaces? I don't know the answer to that. We can poke the PM team and see if they'll talk about that tomorrow. So I've heard some rumblings about it. I've seen some things talked about, but I don't know if there's any concrete plans with regard to it. Actually, that brings up a good point. I don't know how many people know this, but you can actually go to the red hat Jira. So if we go to make sure I have the URL, right? So it's issues redhead.com. And you can go to browse projects here. And I'm not, you see, I'm not logged in at all. I hit the wrong button. Somehow I just logged in. And all I did was click the side button on my mouse by accident. What the heck just happened? Oh, so I'm logged in as my, okay, as my regular user account. So this is my developer account. So if we go to browse projects here, and you can see there's a whole list of projects, kind of a cross red hat and various other things. You can do things like click on open shift container platform here in the left-hand menu. And it'll just be open shift related things. But for example, I can come in here and I can see, oh, at CD, what's going on with that CD and click on this Jira project. And after giving Jira a moment to think about it, there is a way to look in here and see like all of the stuff that's going on. Yes. Okay. Thank you, Jira. And see all of the stuff that's going on inside of here. Browse to the site with this account. So you can get an idea of all the stuff that's happening across the various releases. There's, oh, here's issues. So here's all of the SED issues, you know, here enhancement, IPI scale down support. And again, these are all public. So anybody can come and look. I don't know if you can make comments or not. But yeah, but it's, it's a great way to look and see what's coming in or get an idea of what's happening inside of OpenShift and various components of OpenShift. So I will point out that if you go back to that browse issues thing, let's see, we'll go back to here. Sometimes I have learned over time that Jira search is a little strange. So if you use, for example, the OpenShift installer, so this project name, so cores, you can see the key here. So you can go to search and you can search for like cores, you know, IBM cloud IPI. And it'll search just within that particular project. And you, we can see here that IBM cloud IPI is done. It was added in 4.10. So any little trick if you're not familiar with it. So I think if you can make comments, I'm sure the PMs would, would love to hear, like you can come down here and make comments and to say, you know, hey, I'm customer such and such. If you're okay sharing that information, you know, we would like to see this feature added. Yeah, Tiger just called out that if you, if the project is set to public that you should be able to comment on it. Okay. Yeah, I thought so, but I wasn't 100% sure. I know that most, and they've spent the last year, year and a half going through and making as many of these public as possible. And both the PMs in the field have been told, you know, if there's private information, you know, customer information, make sure you mark that comment or that issue as a private. Otherwise it's, it's. Open by default, kind of like bugzilla has been. Why can't a machine be part of two different machine config pools. So the essentially because, and this is my understanding, because you have, you would have a potential for conflict between the two pools. So pick a simple silly file, right? I'm going to use the SSH host keys, or authorized keys rather. So if you have two machine pools and one of them are allowed access with Johnny's key and one of them is allowed access with my key, they'd basically bounce back and forth trying to replace that file. So the right way to do it is effectively to have one inherit from the other. So we see this fairly frequently with infrastructure nodes, right, where infrastructure nodes, the machine config pool inherits from the worker machine config pool and then add some additional configuration to it. Let me know if that didn't answer your question. And Johnny, you have anything to add there? No, I think that's, that's pretty accurate. Okay. Let's see, two more things to go. So somebody asked in one of your internal emailing lists, you know, hey, my customer is using IPI, but their DHCP server is a single point of failure. Does Red Hat have like a recommended or a set of guidelines for the DHCP server? So the answer to this is not really. And that's because there's lots of different DHCP servers and services available. So really the guidance or the request is make sure it's a highly available enterprise service. How you do that is going to depend on which DHCP server you're using. So if you're using, you know, Windows for your DHCP server, it's a separate set of tooling that you can use. If you're using Quay, no, not Quay. What is the name of that thing? So the ISC DHCP server, which is kind of the most common one, but they also have Key. That's what it is. So Key is another one that you can use. And this one is, I've started experimenting with it. It's super easy. It's like JSON config. I don't know how many people use it in production or anything like that, but I thought it was interesting that the ISC folks have two DHCP servers. So yeah, it's just, you know, if you use Infoblox, if you use the regular ISC DHCP, you know, which would be I think what most RHEL based ones are going to be using. Because you do a, you know, DNF install DHCPD and that's what you get. But yeah, it's just vendor independent. So that's why we don't really provide any specific guidance or anything like that. VGPU is supported by OpenShift virtualization based VMs from OpenShift 4.10. So it's in tech preview in 4.10 I should say. We should clarify. OpenShift virtualization helps with hierarchical name spaces for achieving multi-tenancy on GPUs through VGPUs. That's interesting. I don't know, but I do know who we can ask and that would also be the PM team. Actually, that one I'll send to the engineering manager. So, yeah, well, Shrikant, if you would like, please send me your question in email. Andrew.Sullivan at redhat.com and I will ping the right people inside of Red Hat and we'll get you a response to that question so that we can find out. And I'll make sure for your other question about hierarchical name spaces, we'll get that one answered too. And I'll do my best to remember to post that in the chat tomorrow during the What's Next session. What's the difference between a machine set and a machine config pool from Allen? So a machine set is what defines a machine. So I happen to have a cluster here that we can look at this. So I come down to compute and go to machine sets. So this is a AWS cluster, AWS IPI cluster. So we see we have three machine sets inside of my default deployment. And if I look inside of these, I can see things like in the spec here, it's going to say what availability zone and region I want to use as we come down here what block devices to assign to it. So every time it creates a new AMI or a machine based off the AMI it will assign it a GP3 120 gigabyte disk. Here's the AMI ID to use, so on and so forth. Here's the subnet to connect it to. So this defines effectively the template whenever it needs to create a new machine to scale up the size of the cluster. So machine sets define the template. Machines are the actual provisioned node from those machine sets. So we're just looking at 2A. So if we look here, we have this US East 2A worker and it should match what we saw inside of that machine set. A machine config pool on the other hand is a collection of machine configs. So we see here we have machine configs that go through and here's, you know, the base level double load worker. Here's worker container runtime. Here's worker cooblets. Here's the worker SSH config so on and so forth. And so what the machine config pools do is they effectively collect the tags associated with a set of machine configs. They generate and we see here this rendered master rendered worker machine configs. They generate those rendered or end to end. They contain all of the like tagged ones or matching tag ones and then they apply that to a set of machines that match an equivalent tag. So if we look at our worker config pool here in our spec here, right. Here's name is rendered worker. We see all of the source machine configs that it's matching. So down here's the machine config selector. So anytime that the machine config has a label of machine configuration that open shifts, you know, role of worker and anytime a node has a label of worker it will apply these two things together. So that's the three minute version. Yeah, this question gets asked quite a bit actually because I think the machine config pool is like the one outlier that is just kind of like it seems like it would be obvious because you see the pool and you're like, oh, it's just a pool of machine configs. But at the same time, you're kind of like, but what does it do? Why is it there? You know, why are we actually defining this thing? Yeah. Yeah. And Alan, so is there any templates for bare metal? Is bare metal considered a provider? Yes. With a caveat. So long as you were doing bare metal IPI, it uses what's called metal cubed, which is a flavor of ironic in order to manage those machines. Do a quick search here for metal cubes. We end up at this page. So this is the upstream project for interacting with those. And effectively, you know, we see here the bare metal operator and all of that. So if you follow the documentation, so we go to the docs and go down to four dot 10. Go to the installing section as soon as I can find it. And we look for deploying installer provisioned clusters on bare metal. This kind of walks through all the steps, all of the config that you need in order to use metal cubes in order to deploy and manage those machines. So and it connects over IPMI or it'll use the, you can see the out of band management network here. It connects to those nodes, uses the standard mechanisms for doing hardware management, redfish, so on and so forth to do that. And if you want to set it up in a lab, there's actually a really cool project called sushi that can do that. I'll have to dig up the link for that. All right. Real quick, the last of the top of mind topics. So if you didn't happen to notice and I didn't see this before our stream last week, we were talking about the Kubernetes folks published their changes in Kubernetes 1.24. So kind of always keeping up with what's happening inside of Kubernetes, right? I don't know for sure if 4.11 will be 1.24 based, but I think it's I would say it's a safe assumption. If it's not 4.11, it will definitely be 4.12. So this is a good way. Remember back in the 4.9 timeframe, we were talking about the things that were moved. How do I know about that ahead of time? How can I get as far ahead of that as possible? This is as far ahead as possible without going into GitHub and actually looking at the changes there. And so as we scroll down here, you see there's a whole section for API removal and deprecation. And so things like dynamic Kublet configuration, entry provisioner to CSI and away from the entry stuff. Yeah, I thought this was an interesting one. I called this out to you the other day, Johnny. We are finally moving away, the whole gosh, what is that called now? The language initiative. Friendly language or non-obtrusive language. I can't remember the name of the project now and this is really embarrassing. Anyway, so we're finally moving away from some of those long-term things so you'll see the master label is no longer present and it's instead moving to control plane. So if you happen to have any automation, that's going to be that is triggered off of that particular label. So I'll say that custom machine config pools, for example, that are triggering off of that. Just be aware that that is probably going to change at some point in the not too distant future and be using control plane instead. Let's see, so from Alan, so I can create the pool and select by node just to make sure I'm not crazy, we're all crazy, it's okay. I'll label to the node and select with my machine config pool. Yes. So one thing to be aware of is you generally don't want to have a small number of nodes and by small number I mean especially one in a machine config pool and the reason for that is because if we come over here and look at our machine config pool and so it's not in here, but we should be able to look at I'm going to see if I can cheat custom resource definitions machine config pool. I want to do this because I want to use this thing over here. So if we look in the spec and we go to max unavailable, so it's not defined in most of the default machine config pools because the default is one. But this is what controls the number of nodes that will be affected simultaneously by machine config pool actions. So for example let's say that we update OpenShift which results in a new version of CoroS being pushed out to the nodes. This value would determine how many of those nodes are affected simultaneously. So by default it's one. So in my machine config pool of maybe I have 10 machines, it would go through them one at a time and do it. That may be slow, but it's reliable. You can bump that up. You can use either a number or a, you know, so 2 or a percentage, 20% and it'll affect it that way and it'll do two nodes at the same time or however nodes you specify, however many nodes you specify in there. So the reason why I say having a machine config pool with one node inside of it is a bad idea is because let's say that inside of my cluster I have, you know, right now I've got three worker nodes and I've got three machine config pools, one for each one of those worker nodes because I'm doing something special, you know, configuring them uniquely each. Well, when I go to do a update of my cluster what could happen is effectively all three of those machine config pools would say I can take down up to one of my nodes. So I've got one node, I'm going to do my thing with it. And now all three of my worker nodes just went down or just had some major action happen on side of them, which the only thing that would protect your application at that point is something like a pod disruption budget. So you're not, instead of making sure that there's at least other nodes in the cluster that are still running or still capable of hosting application workload, that protection wouldn't be there. So it's one of those things I have always shied away from making those types of recommendations if you need to do usually when folks ask about having like a machine config pool per worker or per node, it's about like static IPs and stuff like that. I've always suggested that you should be doing that outside of the machine config paradigm. So whether that's at install time, so for example using kernel parameters or booting to the live ISO and using NMCLI or using NM state now with OpenShift 4.10 and bare metal. So Alan, I feel like you might be leaning that way you just say I was going to do that. So if you're using bare metal, whether it's bare metal IPI or bare metal UPI if you're installing two physical servers NM state is supported with OpenShift 4.10. So NM states and we talked about this a few streams ago. I'll see if we can find a link. Yeah, I got it. Okay, thank you. But NM state effectively gives you a Kubernetes native way of managing the network configuration of those hosts so you can do things like go in and set static IPs on secondary and tertiary interfaces and adjust routes if you need to all the other networking stuff that you would normally do. All right. Thank you, Johnny. Yep, no problem. I just realized I don't even have a stream up. I've got the interface here but normally I have like Twitter or a Twitch rather up. And it's a compassionate language project. So thank you, Chandler. Yeah, thank you. So I just posted that link that Johnny found for the NM state's live stream into Twitch should get re-broadcast across the others. All right. And thank you, Stephanie, also for publishing that. Okay, so let's talk about today's topic. So again, we talked about this at the start of best practices are not required, they're not even recommended in every situation. We should still take a critical eye and we should evaluate whether or not best practices should be applied to each one of our scenarios. When I was a storage admin, when I was worked for a storage vendor we always used to talk to folks about performance. And it was what's the best practices for configuring my storage with VMware or something like that. And here's 20 things that we see commonly done but the recommendation was always do these one at a time and test before and after to make sure it has the intended effect because it might not. It might be mental because your workload is different. Your storage access pattern because of your applications is doing something. So much like that, I think everything that we'll talk about today is always or should always be individually considered. But it is very common things that we see applied that we both Johnny and I individually me and I've been in the BU for the last three years. Johnny was a consultant for a long time so saw a lot of customers are doing. And remember, every one of these things is stuff that we solicited or we got back as responses from our field folks saying, you know, yeah we see our customers doing this. We encourage our customers to do this in a lot of cases. So, all right any favorite ones you want to start off with, Johnny? Do we just want to go with the number one or do we want to start with like the other number 10? I did mention that didn't I? We should do like a David Letterman top 10 type of thing but no, I think we can either go, we don't have to go linearly but we can jump around a little bit. Yeah, so let's start with the suggestions up here. So, from an architecture perspective I think one of the, there's a couple of important things that were brought up. One, it was recommended by several folks not to use stretched clusters. So, that's something that we have talked about here on the stream several times before of stretch clusters logically a lot of people like that concept or that idea, right? I have multiple data centers even if all of the, you know, latency and throughput requirements are met for OpenShift and for SCD, you know it makes sense of, you know, yeah, I can lose an entire data center. And that entire data center can if I lose it because I still have resources in other data centers then I can I keep functioning. I keep being able to utilize my my cluster. Sorry, I'm multitasking that's why I'm talking intermittently. Multitasking trying to bring up the link here. So, Red Hat published this KCS you can see it's updated regularly and frequently to cover recommendations around spanning clusters across multiple sites. So inside of here it doesn't actually say don't do this I would say it's recommends against it but it's still a supported configuration you technically can do it. So why do we suggest not doing this? Why do we recommend against it? And in my opinion and I think many other folks opinion it's because of the increased complexity. So we think we're protecting ourselves from data center failure, right? If all of my whole clusters and one data center that data center fails now I've lost my whole application is down. Whereas if it's a spread across three data centers then I can lose one and everything continues to function. But the problem is there's a lot of complexity that comes in when now I have to figure out how to keep the networking at a minimum probably a can if not the WAN working across all three sites. If I encounter latency spikes or storage issues at one side or any number of other things suddenly I end up in the scenario where it's not one big event that affects my uptime. It's a thousand little events. I lost five minutes here. I lost two minutes there. I had this spike in capability or a latency here and stuff like that. A long time ago I worked on a military base. We literally had a tank run over the pipeline that the fiber was running through and it squished the whole thing. Otherwise it was perfectly fine. The other data center was like a kilometer away. It's things like that that you can't predict. So while there's nothing inherently wrong with it and it's not unsupported it's just not it often doesn't achieve the results that you want. I'll pause there. Johnny thoughts. I completely agree. I think like the more layers you add to something that's already complex it gets further and further away from you and then trying to debug something that's going wrong. It makes it harder and it takes longer to figure it out where if you add a little bit more isolation it narrows the scope a little. I agree. Again if that is something you're interested in, stretch clusters, just carefully evaluate what your end goal is, what you're trying to predict against and if there are other better ways, I say better strangely because it's subjective better is subjective. Usually our most frequent recommendation is multiple clusters and using something like GitOps or Ansible or ACM or whatever your tool of choice happens to be to replicate or duplicate that application functionality. It really just determines the level of availability that you have to have for that application and then really you have to go from there. That's your starting point and then work backwards from there and figure out how to get storage and everything replicated. It's a hard problem to solve. Another one that I thought was interesting and another one that we have talked about here on the stream before is doing backups of SED. This is something that we hear kind of both ways on frequently of oh yeah, we always take backups of SED or we never take backups of SED. I fall into the camp of you should have them but be aware of their best use case. If you are wanting to fully recover a completely lost cluster using an SED backup, that's not the ideal use case. You're probably going to have a bad time. If your cluster is very large, think dozens of nodes, thousands of pods or hundreds of nodes and tens of thousands of pods if your SED database ends up being very large having a backup of that can make recovering from a SED node failure dramatically easier. A couple of weeks ago we showed how to do that SED recovery despite me working the DHCP portion of it and I think I mentioned during that time of if I had an SED backup, I could restore that backup to that new node and then the existing other nodes would only have to re-sync data that's changed after that backup was taken. If my SED database ends up being multiple gigabytes in size then replicating all of that data to a new node can tax what might be an already strange storage subsystem or network subsystem trying to do that. So SED backups are useful for recovering individual nodes you can even recover like two nodes but if it's a full cluster loss then I would say you're better off creating a whole new cluster and then recovering the workload on top of that cluster so redeploying the applications reintroducing the PVC's all that other stuff and for our audience please don't hesitate to weigh in on any of these. I'm interested to hear your experiences and see if you have any thoughts or things that maybe contradict or maybe concur with what we've seen and what we're saying. Yeah and just to kind of like go in a little bit on that too like make sure that you're backing up your configs you know so if you're creating YAML by hand right for your PV or whatever keep that stored somewhere put it in get you know preferably put everything that you can and get and then that way you can have level of automation getting it back maybe write some playbooks or using get ops. Yeah that's an interesting concept Johnny have you ever seen folks doing that not doing get ops but using get to track the configuration when we last deployed it it looked like this. Yup that's exactly what they'll do is they'll essentially do like an OC get minus OYAML dump it down to a file and then take out all of the metadata stuff and then just essentially pump that right back in the get and then pull it and then OC apply minus on the directory. Yeah that logically that makes sense to me it seems like an easy way to protect that information without having to necessarily go where's Christian when I need him without having to make the leap all the way into the get ops philosophy. Yeah because get ops is such a it's not new but it's newer and I think that like people are still trying to there's still a lot of people trying to get on the dev ops train and trying to get like that whole wagon moving the right way and now there's get ops so I think that this is kind of like that nice little intermediary because at the end of the day you just need to have a backup or I don't want to say a backup you need to have a copy of it somewhere so that way you can restore from at least a known good source instead of trying to hand jam it out from memory and if for me it's gonna be terrible. Yeah get ops I will fully admit and Christian makes fun of me all the time for this like get is intimidating I always make fun of Christian and I tell him get is what happens when engineers just run a muck right when they have no product manager or anything like that to say we really be doing that that seems like a bad idea and you know there's like 9000 sub commands and options and and all this other stuff you know the last time I developed anything you know wrote any code like SVN was started state of the art so just you know instead of having to go through this this process of adopting get ops and having to figure out how to do you know pull requests and reviews and you know doing cherry picking and like all of this other stuff that is um it's intimidating for especially for us administrators or me I should say I don't know about everybody else you know it can be a jitler slope if you will. Yeah no I totally agree and outside of like a normal get workflow like the get add commit push type thing like it could be black magic to a lot of people and then like even even where I feel like I'm decently you know I'm decent get like I'm still constantly asking all the time for health there's just so much to it and Chandler made a good point about customize like customize is another great way just very similar to just what we're talking about right you can have you know just a customization file with your changes and then apply that yeah so that that's a good call as well. Yeah yeah um so I see there uh Chandler no Shrikant the idea of developer control planes so it's interesting you bring that up I recently saw a presentation about a thing called KCP so bring up the GitHub here so KCP is I'm trying to think of the best way to describe what it does it abstracts the Kubernetes cluster and the Kubernetes API from or let me think let me rephrase that the Kubernetes API and the Kubernetes workload from the underlying Kubernetes cluster so basically I create a workspace in KCP and it is a Kubernetes API endpoint that I as the user have full control over I can do things like deploy my own you know operators my own CRDs inside of there and all this other stuff um and it it offers that ability to basically give you know each one of those developers or developer teams their own control plane to do what they need with it so I thought this was super cool it's like a super really you know engineering folks inside of Red Hat are you know experimenting with it playing with it I thought it was a really cool presentation and to see what's happening there so did we post that into um let me post it here so that way uh send that out um so it's just something to keep an eye on right nothing anytime soon that I would expect to see from Red Hat or from OpenShift on that but if that's something that's interesting to you Shrikant keep an eye on that I don't have experience but I strongly recommend against a total control plane loss yes that yeah whole heartedly agree our hope nine ASD I want to be a Red Hat engineer phenomenal I highly encourage anybody and everyone who is interested in joining Red Hats we don't have like an internal external jobs portal everything is on like jobs.redhat.com I'll post it in the link on redhat.com maybe jobs.redhat.com there we go yeah it um you can tell I haven't been to this site in a while I'm happy with what I'm doing so yeah again I encourage everybody if you're interested come here look for any roles that are interesting to you if you have any questions about like roles like what does this mean like you know maybe what does this role actually entail or something like that you're welcome to reach out to us to explain terminology and stuff like that to you we don't mind at all yep you can use getups in reverse where you test changes to see the outcome then update get as collective memory for recovery purposes yes absolutely that's what I used to do way back when I was a junior v-sphere administrator we would create PowerShell to do what we you know to make it look like how we wanted and then basically implement it and then take that PowerShell script and put it into an SVN repository let's see working down our list here um so one of the ones that I thought was super interesting that actually came up I think four or five times as well from different folks is the concept of onboarding and it's it's not like an action that we take as administrators right it's not a apply this yaml and do this thing it's rather helping our developer friends and application teams understand hey you're using Kubernetes do you know what that means do you know how to effectively use the features and functions of Kubernetes and I tried to find and tried to see if there was some collateral inside of Red Hat that we could share of like hey here's you know 10 things that you should show your developers to help them understand things like um a quotas right or and we'll talk more about those in a minute right uh requests and limits for pods and and all of this other stuff so that way you know we can or we don't accidentally paint ourselves in a corner through inadvertent misconfiguration if you will at the application level yeah I actually had a customer that they were onboarding uh like a big monolithic application they were just trying to get it into OpenShift and they ended up essentially like taking this entire application all 32 gigs of RAM and you know 8 CPUs and everything and just plugging it in and then they're running like it was insane the way that they were doing this but yeah it's essentially 316 or 316 gig and 4 CPU nodes just ate it right away like they took the cluster down in like 35 yeah it's definitely one of those um you know and here are one of the examples that somebody sent us if you have a cluster um but users don't understand how to use requests limits probes for you know readiness liveness um pod disruption budgets replicas more than one and so on and so forth effectively as an administrator you're now limited because you know let's say that the um you know they created just a standalone pod deployment and you know there's no scale there's nothing that recovers that per se if something changes um you know hey I need to do an OpenShift update and it does a coordinate drain of the node and that pod might not come back right we we want to use things like deployments and replica sets to ensure that there is always a minimum number running and that number should be greater than one in many cases so having that onboarding process having just a basic you know Kubernetes 101 course for um uh for maybe developers application teams who are new to Kubernetes is something that um I could definitely see is very very important um um Tiger join us and uh might take you axe throwing yeah uh so for anybody who doesn't know Tiger was our intern back in 2019 before the pandemic and all that happened and um one of the team building exercises that we did was to go axe throwing here in downtown Raleigh um I'm terrible at it as you can tell I'm super athletic right that's that's my thing um so I was terrible at it but uh we had a lot of fun doing it I was very surprised that the axe throwing place served alcohol that seemed like a huge liability to me but yeah yeah who why hahaha hahaha our hope 9 SVN is a slightly better RCS uh yeah that seems that seems accurate um the team disassembled um yeah we uh we grew quite substantially actually um I think we were 8 or 9 at that time and now the team is 25 or so so uh good times good times back in the in the before times right when we all got out of the house more often than my wife was making fun of me because we went to uh went to Costco this past weekend I was like I don't think I've left the house for more than like walking the dogs in the last week or maybe even two yeah struggle is real man the struggle is real yeah uh so continuing on with our previous um onboarding thought process um you know let's expand on that a little bit of what are some things that we think you know most if not all applications should be using right and our pod definitions are our uh deployment definitions whatever it happens to be that we're using uh for our applications so the the two that came up most frequently were requests and limits but there was some controversy here right so a request is tantamount to a uh a reservation from the scheduler's perspective so it doesn't actually reserve those resources on the host it just sets them it reserves them from the scheduler so that so let's say I have a node with 8 gigabytes of RAM and I create a pod with a request for 4 gigabytes of memory so the scheduler will say 4 gigabytes of memory on this node are consumed even if that pod is only using 128 megabytes and you know we see this fairly often where I'll have a node that is you know it's out of capacity right I can't schedule to it because it's out of capacity and when you look at it in you know the metrics or something like that so we can actually look at this uh so if I go here to the dashboards and I look at for example my nodes I don't want I think I want this guy right so here you know my CPU usage is at like 2 right my CPU quota down here is all very low my memory usage is all you know relatively low but if I switch over to the CLI and I do like an OC describe node this is a worker node we'll go with that guy down here at the bottom I've got this requests and limits so you can see that even if nothing is going on on this host 25 percent 24 percent of the memory is set aside and can't be allocated so we'll see this where a node is at you know 98 percent or you know 95 percent memory requests even though actual node utilization so if I do an OC ADM top nodes we can see here actual memory utilization on this 151 through 17 is only at 21 percent you know CPU is at 5 percent even though 36 percent of the CPU is tied up in those requests so we have to strike this balance so the feedback we got from the field was we strongly encourage everyone to use limits and requests and then we got some more feedback and it was limits are great or limits are great requests are sometimes okay and then we got some more and it was requests are great but limits are terrifying and like it was a lot of mixed feedback and the reason for that is because well it can have different effects on different applications so for example let's say that I'm running a Java application and my Java virtual machine I set a minimum memory of 512 megabytes of RAM and a max of 2 gigabytes in that instance it probably makes sense to set both a request and a limit to those values plus or minus some percentage for various other things that are happening because if it's going over that limit there's probably something wrong something unexpected is happening or something that we don't want is happening inside of there but other times and I'm thinking about you Python where suddenly I'll the developer does something crazy they made a mistake or something happened and we just loaded like a 30 gigabyte object into memory and we unpacked this massive array or something like that or this JSON object came back and it was 600,000 lines instead of 6 something like that so that is also unexpected behavior and if it unexpectedly consumes dramatically more memory or CPU than it should do we want to limit that sometimes yes sometimes no that is again picking on Java if I limit the CPU so I put in a limit that says you cannot consume any more than one CPU, 1000 millicores but it's a very large Java application and if you don't know when Java first starts up it loads all of the classes and all the other stuff that it needs and it consumes a lot of CPU upfront and maybe it's with unlimited CPU it takes this long but with one CPU it takes this long but there's a threshold eventually it will fail it will say I can't complete this task and it will terminate the process it will just fail to load so now I am at this point where because I limited it my application can no longer start so it's one of those as a developer as an app team as an administrator we have to be aware of and we have to work with each other to understand what those requirements are and we have to use these statements around use requests and limits because sometimes they're not good sometimes they're great just great link to have bookmarked when you need to troubleshoot an OpenShift 4 issue I have a feeling or hope 9 that this is the generic troubleshooting, yep consolidated troubleshooting article for OpenShift yep that's a great one so if the link that our hope 9 just posted in there a bunch of KCS articles all related to troubleshooting things in OpenShift and put them into one so you can link out from there so definitely bookmark that one if you don't already so for all of those same reasons limit ranges were just as controversial right because so if you don't know a limit range will set a default as well as a minimum and maximum request and limit for your pods so even if the developer doesn't know or doesn't take the time to set a say a limit the limit range will automatically enforce it inside of there so where I'm building with all of this is I think you know Andrew's opinion is quotas are the real hero here because I can set a per namespace quota and then my application team if they make a mistake or if they do something like that they're only affecting their other application components they don't have the they probably won't affect the entire rest of the cluster or a lot of other services that are running inside of there so it was something that came up again for multiple people when we asked the question of use quotas implement quotas from day one you can and I think we showed in one of the early streams I'll have to dig up the link and put it on a blog post you can set a default quota that is automatically applied using the template the new project project template and do I have the docs up anywhere here I do so if we look for a new project template and if we go to this configuring project creation I'll post this into twitch here so this effectively is a template not an open shift template but a template in the traditional sense a template for what happens when we create a new project a new project and you can add in any number of objects for it to create inside of here including quota objects that get applied to every single namespace when it gets created there are some secondary issues that occur there technically depending on the permissions that the user has in the namespace they could delete or modify their quota I think the community of practice has an operator for that where they enforce quotas and you tell the operator what you want the quota to be and it makes sure that even if it gets changed it gets reverted to what you want it to be but yeah that's one way that you can do something like that for a mandating quotas inside of your projects this is how this is the approach that we've taken with pretty much I don't want to say every single customer but like when you talk about having a customer with multiple users it plays into the multi-tenancy so it's like go ahead and create this new project template that way you can come up with some same defaults start there that's your baseline and then you work your way up but going back to the limits and ranges and limit ranges and all that it's a fluid discussion between the admin and the end user and then that way you can work together to tune it and get it towards just right but it's okay to come up with basic defaults and then build up from there yeah in my opinion it leads into and I don't think I actually put this one in our notes document Johnny so a surprise I think one of the critical components of this is monitoring having we do have an item here to talk about monitoring or alerting rather in just a moment but having the monitoring in place so that you know these things happen so that you can take action whether that action is hey we really do need to increase the size the resources in the cluster or I need to go and have a conversation with this team and figure out what's going on over there like why do they keep doing thing X whatever it happens to be one of my favorite stories was we had a when I was a storage admin we had a developer who made a mistake and they created an infinite loop and they were trying to read from a file using pearl and it aired out and it just went into an infinite loop of error and they ended up creating a like a 10 or 12 terabyte text file that was just the same like four words over and over and over and over again I get a call and it's like 1am because throughout the day the file system had been filling up and the help desk they finally saw it on the monitor like this thing is at 88% full what do we do well does it look like it's going to stop well it's been building since about 7 o'clock and it's it was at 60% then ok and we ended up having to we expanded it was a net app we expanded the aggregate to add more capacity to keep the whole thing from going offline and then we basically had to use various tools to figure out what the process was and kill it that's awesome pearl Christian anytime when someone mentions GitOps anywhere I'm notified it was nothing but good things I promise we love GitOps Christian let's see a couple of questions here Divya please let me know the difference between a bastion node and service layer NAT gateway I'm not sure what service layer NAT gateway would be referring to Johnny do you have any thoughts on that the only the only thing I can think of is you know like the NAT gateway in AWS I'm not sure if that's what he's talking about yeah I'm if you can clarify what you mean by service layer NAT gateway Sebastian node is usually like literally a nodes that exists inside of that environment so for talking AWS it would be inside of that gosh it's the name for a network VPC so it's a node that is inside of that VPC that we can you know connect to as an administrator or user and then access other resources inside of that VPC what would you recommend scheduled to be set to pack wide or high so this is a hard one right and it's it's again one of those that always depends so virtualization we always think go wide right we want to distribute virtual machines across all of the nodes so that way if anyone node fails it doesn't take down and you know unusual amount of resources I think Kubernetes by and large we probably think the same where you know I want to evenly distribute my workload across all of my nodes so that way you know I'm not I don't end up in a scenario where the failure of one node results in a lot of extra work or impact and also it gives you some more flexibility with you know hey this pod on node 7 which is only 30% utilized it has a lot more head room before you know it needs to burst up in CPU or memory utilization something like that so using the what is it high node utilization I think is the scheduler profile we can look at the scheduler profile if we look at the scheduler profiles which is effectively what I think we're talking about here so low node utilization does its best to spread pods evenly across nodes high node utilization places as many pods as possible onto as few nodes as possible my opinion is this is really useful when we're in a hyperscaler because in a hyperscaler I pay for 100% of that instance you know it's CPUs its memory whether I'm using 1% of it or 100% of it so if I can drive utilization of the individual nodes as high as possible and prevent additional AMIs or AWS, Azure, Google whatever instances from being provisioned I am in theory saving my company money and then I can rely on things like cluster auto scaling to add more resources when it's necessary namespace operator, oh yeah thank you Johnny we'll post that link in just a moment so Johnny dug up the I mentioned before the community of practice for the red hat field folks they have this namespace configuration operator which among other things will do quota enforcement there is another question up here Alan asked about the does the ingress controller in charge of the upload speeds or basically what's managing the KeeplabDNHA proxy that is the ingress controller I don't know if we have exposed some of the advanced functionality advanced you know the functionality that we used to expose or allow people to modify in OpenShift 3 I don't know if it's exposed through the operator yet yeah I'm not sure you're there alright let's rephrase that so the ingress controller manages the ingress HA proxy so if you're referring to KeeplabD it depends on which KeeplabD we're talking about here so with an on-prem IPI deployment the API and ingress virtual IP addresses are managed by separate KeeplabD instances their config is actually controlled through machine config and if we look at github.com slash open shift slash machine well I'm not going to remember what it is so we'll search for it on machine config so if we look for machine config operator here and then we can see specifically what it's going to use here if we go to templates and then common and on-prem and files and we have this KeeplabD so this is where we can see the KeeplabD config for those API and ingress VIPs so they're controlled separately there is also two other concepts so one is the KeeplabD operator from the COP so it very similar functionality very similar thing right it uses KeeplabD to manage what is effectively a ingress IP so if we come back over here to our docs ingress IP so ingress IP with an external IP so effectively that KeeplabD operator from the COP manages this external IP that we're associating with a service so we can have that ingress IP functionality so kind of different functionalities different ways to control those things depending on what specifically you're doing inside of there let's see I'm working my way up questions here yep Contest is asking so he's having some deployment issues essentially the first control plane node is coming up and getting its config looks like it's setting correctly and then the second and third control plane nodes are essentially config not found no hosting getting set up um I would be curious what deployment type so UPI IPI non integrated okay UPI vSphere are they all on the same subnet all connected to the same network because they're a firewall or something like that between them or some other network configuration so the way so behind the scenes what happens is the bootstrap node comes up it starts a proto control plane and most importantly for the next step a single node at CD instance and the machine config server and then the control plane nodes come up and they look to that machine config server to get their configuration and that includes here's all the pods that you need to start here's you know how to start the at CD operator and when it all works according to plan effectively the bootstrap hands over at CD and the control plane functions to the new control plane and then the new control plane takes over um so it sounds like it may be for whatever reason there's a network communication error maybe a network configure so network config you know checked if you're using static IPs check to make sure that things like subnet mass gateways are set correctly make sure there's no IP conflict I've done that myself in my lab I don't know how many times oh yeah it doesn't ping it must be available no it's not I see network communication usually it's things like firewalls that get in the way or misconfigured routers but yeah Johnny's suggestion there of checking the logs on the bootstrap node so you can SSH to bootstrap it's SSH core at you know bootstrap IP address using the SSH private key that you that maps to the public key you provided an install config.yaml that was hard to say once you get there the message of the day that prints when you log in has a journal control command that you can use to see the boot queue blogs and that'll that'll tell you what's going on behind the scenes let's see chat jumping around on me yeah so one of them was I'm gonna kill this username I will kick I will kick a 67 he said he they recently or they my goodness they over provision their nodes a lot and they see probe issues when things get near max CPU on nodes are there any suggestions on that front they have to set their cooler to reserved but he thinks that because they're over provisioning those settings that like basically there's nothing really in place to protect them yeah that that would be expected I think so when you're saying over provision are you over provisioning pods on your Kubernetes nodes on your OpenShift nodes or OpenShift nodes on your hypervisor nodes I think it's probably the former judging by how you describe that so yeah one of the things that you can optionally take there and we talked about this is gosh where's that gonna be documented at I think it's under nodes I'm looking for the documentation on how to have it auto configure the system reserved memory and CPU so yeah I'll see if we can dig that up we um we did talk about it on the stream before I'll have to see if I can find where we talked about it on the stream that might be one option to maybe eliminate that or alleviate that but ultimately if there's a lot of especially CPU contention like yeah it's it's there's not a lot you can do in that respect to have it for example prioritize liveness and readiness probes or something like that instead you'll just have to rely on the system to do things like evict nodes that are over consuming resources or evict pods rather and let them be rescheduled but yeah over over provisioning is great or over sometimes we call it over commitment or over subscription right so the concept there is I have four CPUs on my node and maybe I'm not setting a request a request is like a reservation and my pods are using less than four CPUs of of resources but maybe they have limits that are at you know when I total them together there are 12 CPUs or 24 CPUs or whatever it happens to be so there exists this possibility that the pods the workload can consume all of the CPU resources on the node so yeah it aside from protecting Kublitz resources and some of the other things the workload is usually left to its own without that even with a request it's not a reservation so if some other pod is running and over consuming resources it would just be a lower priority and therefore be evicted before we can use that have a request on a limit set there's a very roundabout way of saying unfortunately no we got there yeah Lewis and clusters become large many pods and namespaces tasks are previously simple and easy to begin to get complicated such as updates yes so that was something that we one of the other things that we had come in which was a an opinion we also asked our field folks for controversial opinions so an opinion that we are seeing a an uptick in the number of folks who are using smaller more purpose driven clusters as opposed to very large multi-purpose clusters and it's for that very reason it can be complicated from a management perspective I need to have dozens of machine configs that all need to be updated and managed and revision controlled and all that other stuff because I've got maybe five or six different node types and they each need different configurations applied GPU node and network optimized node and high memory nodes and storage nodes that have ultra fast resources and it can get complicated very quickly so the best that I can offer there is anecdotally and we heard this from Tushar when he was on the stream we're seeing customers move towards a larger number of smaller clusters the good news is I guess twofold so one you aren't penalized from that from a entitlement perspective right control plane nodes infrastructure nodes are there's no entitlements associated with them you do you know using AWS or Azure or something like that you would have to pay for those additional resources used but so there is some good news there the other side being ACM is getting better and better and more scalable I think they're up to 2,000 clusters or something now that they can manage so ACM can help make that manageability with many smaller clusters easier Johnny anything to add there I agree and I think that with now the capability of having like the smaller flavors of clusters where you don't necessarily have to have a three control plane in number of worker nodes we can have the compact cluster of three nodes or we can have a single node openshift that type of thing I think with that coming out and then those services just getting better and better it makes more sense maybe just do it that way rather and deploy super small and then that way you're only really using up your podcast yeah I'm scrolling up through chat here Gatesh please do you can ping us on twitter on reddit on email andrew.solovan at reddit.com whichever works best for you we used to set them manually before the script existed and haven't switched yeah that was for the system reserved resources yeah the script and I need to dig up where we talked about it in the stream because in the stream I walked through like how it determines the resources that it's going to set which is in line if I remember correctly with Google's upstream documentation as well yep thank you Stephanie for posting my email address there alright so let's let's in the interest of time let's dig through a couple of these see so we talked about limit ranges limits requests we talked about pod disruption budgets briefly so I'll reiterate and drive home the point of pod disruption budgets are important because it communicates to the scheduler how many of those applications pods can be affected by anything and everything at any point in time so whether it's doing something like rolling out a new deployment or a new version of the deployment whether it's you know coordinate positioning nodes for an update process or maybe it's a lost node you know it uses that pod disruption budget to help prioritize you know when pods are restored help determine when we can't do maintenance operations on nodes and stuff like that I always encourage folks to use pod disruption budgets but again sometimes we have to work with our application peers because if they set them in correctly we can end up in a scenario where we can't do maintenance activities so for example I have three worker nodes and they set a deployment with three replicas and a pod disruption budget of zero all three of them must always be running all of the time well now I can't really do node maintenance operations because if I do maintenance on my node then I might end up in a scenario where you know I have to coordinate and drain it where not all of those are going to be available and that will cause the rollout or the updates to fail they'll pause and wait so Stephanie I see your chat here on the back end I don't know for our audience if anybody's having issues with hearing us or seeing us just let us know but I think everything's okay it looks fine from what I've been able to see yeah just let us know same for me I was able to see it on youtube and twitch let's see so some cluster configuration things that folks were suggesting to us create news in an admin account other than kube admin this is a good one so there's two aspects to that so one kube admin it's useful in some scenarios especially if like your certificates expire and you can't bring up the ingress controller or something like that the installer generates that kube admin file so if I switch back over to my screen here let's see I want to share this window so if we look inside of here the clusters so this is the directory where I deployed my cluster from we look in the auth directory we have this kube config and I can use that to connect as kube admin see who am I because that's what I'm using now I can connect with that user at any time but we probably want to rotate the password at a minimum which we showed a couple of weeks ago I'll include a link to that post from Andy Block and you can regenerate this kube admin certificate that it uses you can also create certificate based logins just like this one works for other things for other users so I can create an Andrew user or a Johnny user and just like that kube config we see we have this big set of certificate data and all that other stuff in there and yes I'm okay showing this because this cluster gets destroyed immediately after the stream ends so it enables that authentication that sort of bypasses the traditional methods so there's no if you've configured LDAP authentication or something like that it's a direct certificate authentication which can be helpful but we should still protect that account we should definitely consider not using kube admin because it's well known and instead creating a Johnny admin and giving them the same certificate based login but just make sure that before you blow away kube admin you actually have another cluster admin because you do not want to be I've done that to myself before not with kubernetes with vcenter yeah actually I think when you and I work together you might have been the one who saved me very possible hey can you I accidentally removed myself as admin can you add me back in oh yeah very likely that that happened yeah I see moving down our list here setting up time synchronization yeah that one is very important for a number of different reasons one if it's off too far during cluster setup it'll cause the nodes to fail to configure basically it'll usually shows up as a certificate error saying the certificate isn't valid because it's too old or too new those are always fun this certificate is from the future yes so yeah time synchronization is important both at cluster setup as well as you know going forward you know old school basic admin principles things like logging you know if your time is off on your nodes then your aggregated logs are going to have weird out of sync time stamps and that makes troubleshooting and problem resolution really hard Walid need to check blah blah blah not setting limits just set a resource request but no limits as apparently is cause of many side issues yep yeah that doesn't surprise me yeah Hillary yeah so and for anybody who doesn't know Hillary is the new co-host of the get ops guide to the galaxy Christians stream so that's a good one I'll look at that right after our stream ends here Walid I didn't know about that I've got it linked up over here so that way I can check it out yeah same moving down the list object pruning so let me copy my already existing link to the documentation object pruning is or comes in two flavors if you will so one is pruning the registry and that's normally what we think of when we think of when we hear the word pruning in containers together you know pruning the registry removing no longer used image layers and stuff like that which is definitely important but when we're talking about pruning in OpenShift and with the nodes what we're talking about is cleaning up unused objects and unused things on the host as well so just like with the registry we can end up with unused layers over time our nodes end up with you know container images that are no longer used new versions have been deployed layers are now obsolete or orphaned empty dir directories that are no longer in use lots of other cruft that builds up so you don't have to configure it to happen at periodic intervals it'll basically happen automatically when the node reaches a certain threshold I think it's 85% it'll automatically trigger a pruning garbage collection type of action and I think this is what I'm looking for here no that's deployments anyways so it will occur at the host level once it reaches a certain point but there are other things and we can see so deployment resources builds cron jobs all of these other things which just create clutter inside of our objects so there probably won't be so but if I do oc get cron job dash A and oc gets job dash A I can end up in a scenario where you know maybe I created a job and it spun up a thousand pods and that job is just sitting here you know not doing anything or oc gets pods dash A so you can see I've got all of these completed pods in here and they will periodically get cleaned up automatically but if I've got a high rate of churn right again I've got a cron job or a job that creates a whole bunch of pods or something like that maybe I want to configure something to go in here and clean up this stuff it's just etcd hygiene that type of stuff inside the cluster and as of it's 4-8 Johnny etcd will automatically do compaction as well so that whole what we were talking about before with etcd backups you know if it gets really big and I have a failure and now I have to replicate a bunch of data and if a lot of that data is just old junk that we don't need anymore it has a ripple effect so just hygiene type stuff yep exactly and the other thing to consider about jobs is that they without cleaning them up you can't rerun the same job you have to go and destroy it and so if you're not aware of that then you could be debugging why this thing isn't working yeah oh yeah that's a good reminder let's see so I've only got I think one more here two more so in the interest of time since we're approaching 1230 we'll cover these last two for anybody out in the audience any questions anything that you want us to address go ahead and send those over now we'll do our best to answer all of those questions here before we go or before we end the stream anything that we don't get to please feel free to send us a message so whether it's social media, twitter, I'm at practical Andrew Johnny is at jrocktx1 email andrew.sullivan at redhat.com or johnnyjonny at redhat.com we'll cover all of that again in a few minutes when we get to the end of the stream but yeah go ahead and submit those over while we cover these last few items all right the penultimate of our list here alert manager this is one that I love because honestly I run a lab right my clusters very rarely run for more than a few hours and even then or even more rarely do they run for more than a couple of days so and what's in there is never anything important right you know the most important thing might be that's in there is like I set up a configuration to do some screenshots or recording so setting up alerting and importantly making sure that you're getting alerts that are important and not noise is something that is super super valuable you know getting too many alerts is almost worse than not getting any alerts alert fatigue is a real thing one thing that I had never thought of that one of our field folks sent over was the concept of utilizing this watch dog alert for something useful so if you don't know the alert manager has this watch dog alert it is always firing and you can use it to test things like alerts you know hey I want to make sure that my email config is right or my pager duty config is right or whatever it happens to be you know tag the watch dog alert and since it's always firing it'll it'll fire an alert so what they suggested was using something and I had never heard of this before dead man's snitch which is a strange name so simple cron job monitoring service right so basically what this does is it's so long as it's receiving this alert it doesn't do anything it's when it no longer receives that alert that it sends off a message saying that something has happened so it's one of those like you know like maybe the cluster fell off the network or something like that it's one of those like in the absence of anything else something's happening over there maybe you should go take a look it's too quiet the kids are bad yeah yeah our hope nine don't worry about the watch dog we wear the outer why lead is there a way between pv claim retention policy somewhere between delete and retain that is if the pvc is deleted and is released and yeah so they used to call that gosh what was the name of that pv claim policy so there used to be a and I thought it was deprecated and it's been deprecated since like one dot for this is not the page I wanted so there's what am I looking for Johnny not change the reclaim policy yes it was like delete retain and then essentially they are now right yeah so there was one where it would reuse maybe that's what it was called reuse the pv and I think it only worked with file based with NFS based pvs where it would spin up a pod to basically mount the pv and do an RM dash RF and then put it back into the pool of available pvs but it was deprecated for a very long time and it might have been removed recycle thank you yeah it looks like it's still there I find it hilarious that most of the first links I guess that's what I you know duck duck go which I think is an abstraction for Bing sometimes comes up with strange things so if we do a quick search here yeah recycle is there oh that's good to know yeah it's became a recycle policy is deprecated and again it has been since as far back as I can remember like I don't think I first really started paying attention until 1.6 when CSI first became beta and it was deprecated like back then and here you can see here's the pv recycler pod definition and it's just gonna go in and do you know an RM dash RF with a scrub to make sure that it's a secured elite so pretty pretty straightforward maybe that'll be a feature use case they're well-eaten yeah back with mutually sliced mounts yeah you can still do that with the NFS provisioner I think the unofficial NFS provisioner it basically mounts one NFS export and then creates folders inside of there for each one of them yep alright last one let's see the last one that we had was antivirus I thought this was an interesting one because it was also very controversial of sometimes we have security teams right who are I'm not gonna say they're behind the times but they are very set in their ways and you know they're or their policies and their policies are well we have an antivirus agent on every single node of every single operating system of everything that's in the data center and that doesn't really work well in something like open shift where coro s is an appliance and not just any appliance it's an appliance where it's it's actively difficult to go in and do something like an rpm install so it was interesting to me that the conversation kind of ranged between you know just bite the bullet and do it you know and you know work with this the antivirus vendor and hopefully they have something like a demon set where you can easily deploy that you know to the to the cluster without having to worry about some crazy machine config or customizing the image or something like that which keep an eye on the what's next tomorrow there's some interesting news about being able to customize coro s I think that they'll be talking about so yeah it's one of those do we do we bite the bullet and do we just do it or is there some other better way and one of the folks actually had a really interesting point and that was if we think about it what's the goal of an antivirus right to detect anomalies to see when something strange is happening that we don't expect and there's other tools other things that we can use to get very similar results things like ACS now that I think about it I wonder if they weren't one of the ACS people they played you yeah you know so you know what what are the best ways what's a kubernetes kubernetes native way of finding and detecting that anomalous behavior you know we used to call it heuristics right now I think the fancy term is like machine learning assisted or something like that you know but looking for this pod always communicates with this service while the sudden is it now going outside of the cluster and talking to something you know using that type of stuff to identify those workloads another one that I found interesting talking to the folks who run like the open shift sandbox and the open shift learning if you're not familiar you can go to learn.openshift.com redirects over to developers but if you come down here and go to the interactive lessons like I can this open shift 4.9 playground and it'll drop me into an actual running open shift environment after 16 minutes while it provisions it they've done like massive amounts of work in these services to do things like find crypto miners that work their way into these resources and all of that so again it's anomaly detection so back to my original point here antivirus I understand that sometimes we have policies and the policy says we must do this and the best option is to work with our antivirus vendor to see if they have a kubernetes native way of doing it a demon set or something like that but it may be more effective and certainly the whole defense in depth thing use an additional tool to see if you can identify those anomalies happening at the application layer you know whether that's something like ACS there's a whole bevy of other partners that all of their names are escaping me because I didn't write them down like I meant to that offer similar capabilities depending on what it is that you are interested in doing so according to our security team ACS is not very clear on digital forensics and incident response interesting I don't know anything about that but I know somebody who does sysdig yep sysdig is a good one datadog I think is another one so yeah we can Mike Foster is the ACS expert that I typically interact with and they have a livestream as well so we can maybe we can pounce on them the next time that they have a stream and ask them some questions while they falcos another one that you can use as part of the CNCF too they do like you can set up rules and stuff like that it's almost like aid I think yep let's see Carlos in a public cloud do you recommend any strategies of security when I expose my applications I don't have any specific strategies or recommendations I will say that that is 85% because I don't feel knowledgeable enough on the subject to make smart recommendations so anything that I suggest is going to be second hand at best and I think it's probably a topic for maybe we can bring back Kirsten and talk about that or invite Mike Foster on again you know that type of stuff to have that conversation the other 15% is I'm hesitant because for me most of my experience is on-prem so I'm not familiar with the full set of tools that are available in a public cloud to help protect against that Johnny I don't know if you have any thoughts yeah I mean it really kind of depends on where you're at I mean I think AWS has a firewall capability but Azure definitely has one that works pretty well that you can use I think the easy things are going to be certificates make sure that you've got a legit SSL cert on your app and then if you're using AWS you could do like not VPC pairing but there's another term for it that I can't remember where you essentially run through another VPC and that way you have a filter in front of you that's allowing you to go through that's probably the best path Well Eid, thank you for the chat or the screen share isn't there most of the time I just switched over and I'm looking at this at the Twitch stream and I see the same thing so Stephanie your concerns here on the back end were valid because I'm not seeing it either so yeah we'll figure out what's going on there while Eid and see if we can get it fixed unfortunately it'll be that way in the recording but let's see I thought I saw one other message in here Stephen Reeves love the NFS subder provisioner for HomeLab yeah I used to use it all the time too it was my primary provisioner for a lot of things until I started experimenting with OpenShift virtualization and then having a CSI provisioner that was capable of doing clones became very useful because now I can just clone the disk for a virtual machine very quickly alright well I don't see any other chat I think we've worked through our list here there's one or two kind of minor I'm gonna say expansions of things that we've talked about here I'll get those in the blog post you know I've got this big bulleted list in our notes document we'll include all of that stuff in the blog post that way we can get as much of this out into your hands as possible including links wherever appropriate I know we linked a bunch of stuff today and it's scattered throughout the stream so we'll consolidate everything inside of the blog post that stuff out there so just a quick reminder tomorrow what's next the roadmap presentation for OpenShift at 10am eastern time right here on the OpenShift YouTube channel and the Twitch channel I don't think that goes out across the red hat YouTube channel but just keep an eye out for that we'll be publishing it on social media and all that other stuff as well and with that I hope everybody has a great week thank you everybody for tuning in today thank you for all the great chat great questions this has been really phenomenal Johnny and I both get excited when we can answer questions like that it's one of the favorite things about doing this it's the perk yeah for sure so thank you as always Stephanie behind the scenes and thank you Johnny and I will leave you with the last word great show today thank you to everybody out asking all the questions a lot of interesting topics and learned a lot as usual so I'll be here tomorrow to help answer some questions and see you next week you're awesome