 Good morning. Good afternoon. Good evening and welcome to another ask an open shift admin here on red hat live streaming I am Chris short host with the most of red hat live streaming and I'm joined by my Friend and colleague Andrew Sullivan Andrew. How are you doing today, sir? I am well. How are you? I saw I was looking at the whaley this morning. It looks hot up there in Michigan So welcome back to North Carolina. It is very Florida out right now and the heat advisories are in effect For like the third time this summer. So this is yeah, this is very different for us. Yeah Well, they they said this morning that we had the the hottest July ever recorded and it appears that August is is certainly on the same track So I know that has a lot of implications for a lot of people, you know, there's been a number of storms Speaking which to our peers in you know, Boston, New England area. I hope you're all safe after The tropical storm went through there. I know there's a bunch of fires happening on the West Coast and other places as well Even further that smoke is making it here by the way. Yeah, so wow. Yeah, and you know, I'm I'm US based So we hear mostly about the US but worldwide. I hope everybody's staying safe out there because things are crazy Yeah, I have been trying to get a family out of Afghanistan for over a week now and it doesn't look good So I will just leave it at that But enough about world events and such we're gonna talk some compliance and security today. We are so so yes Excuse me. So first. Hello, everyone. Welcome to the ask an open shift admin hour So this live stream is one of our office hours series of streams What that means is that we are here much like if you were in in school at any point You had a teacher or a professor who had office hours. We're here for you, right? We want to answer questions We want to interact with you all whatever it is that is happening in your worlds Feel free to reach out and and ask us or let us know and we'll do our best to address those questions here on the stream If we can't for whatever reason, we will follow up and answer those so every week following this stream Usually on Friday, but sometimes a little later. We have a blog post that recaps everything that we talked about here So if you happen to miss something if we can't answer a question At at worst, we'll put it in the blog post or maybe even follow up in the next stream So with all of that being said don't hesitate to ask your questions across whatever platform you're watching on The the magic restream software Consolidates all the chats in the one so we can see all of those that are going on Yeah, and you can conversate with people on other networks. Yeah here as well So feel free to say hello in chat and ask any questions you may have absolutely including on discord So I know you've got a link for that somewhere. So as Chris mentioned today's Topic the topic of the stream, which again, don't let that don't let that limit you or limit your questions to just this topic but today's topic is compliance and security and Andrew is I like to think I'm aware of security, but it's a big topic and You know, even you and I put together and multiplied times 10 would be just barely scratching the surface. So We were are very fortunate to have three subject matter experts with us So welcoming back mark Russell who was here for our redhead enterprise Linux Coro s episode So mark is a product manager in the open shift business unit or I guess now we're the hybrid platforms business unit. Yes and So to two first-time folks on the stream. So we have Doran Caspin who is also a product manager and Juan, but we call him Oz and one. I actually don't know where you sit in the organization. I Don't know either You're you're on the engineering side, right? Yeah, okay. Okay, cool So what lots of expertise lots of brainpower to help answer Answer questions about whatever it happens to come up and we've got a huge list of topics that we want to get through today So I'm going to waste no time and go ahead and get into this week's top-of-mind topic list So I have let's see how many are there for this week. So we'll go ahead and churn through those real quick As you go first thing I'm going to do is Find you know, I really need to I wish I know I say this Like every week, I wish zoom would make these little windows bigger so I could tell which one is which easier. Oh Yeah, so The first one that I want to talk about is the time to update cluster operators has been reduced So did you grab the link for that? I got it. Okay. So I am on the release notes here So if we scroll all the way up here, this is simply the OCP 4.8 release notes and Way up here near the top Enter again, so that way it's easier for me to find You know, I say that and now it's improved upgrade duration So way up here at the top basically engineering did some testing some validation and made some changes so that as you can see here the time to Upgrade a cluster and specifically the cluster operators has been dramatically reduced So the magic behind the scenes here is actually pretty straightforward So I'm going to pick on this one, which is the multis Steven set Effectively what the change was is to change the max unavailable to being a percentage Pretty much across the board. They're set to 10% or something along those lines And the change here is that of course more hosts are able to update at the same time Which means that as you scale up the number of hosts in the cluster that's time to do the update or apply the update is relatively flat Do keep in mind though that this is only the cluster operators So if there is something that for example triggers a reboot or something like that It means that you would still be at the mercy of a pod disruption budget or you know Maybe some other operator there that or some other deployment that is preventing That update from applying to more more than one host or more than the available hosts here But important thing is at least in many configurations. Maybe even most configurations Updates are gonna happen a lot faster, which is definitely good news So and did you grab that the example link that I had there? And I don't know if I actually posted that one into our share So I'll post that in the chat. Thank you So that was the first one relatively straightforward just to be aware that if all of a sudden you were you know Expecting to click the update button and then go and get lunch and come back and maybe then get a cup of coffee Now might maybe just get that cup of coffee. Yeah, or maybe just lunch instead of both Anyways, so the next one that I wanted to talk about is a question that I've had asked It's come up twice in the last week, but it's kind of come up periodically over the last year or so and that is Can we use the integrated load balancer that's deployed with on-prem IPI? With other deployment mechanisms, but can I use that with UPI? Can I use that with non integrated so on and so forth? so Officially no So essentially it is not a tested supported configuration in order to do that, but it is technically possible So I'm gonna come back to OpenShift here and GitHub come on and if I search for machine config and We look at the machine config operator and we go to templates and I think it's in common On-prem Isles so what we see inside of here. No anyway, somewhere inside of here is Basically the configuration that's used for that and it's triggered at deployment time Simply by having those virtual IP addresses defined So if you in your install config.yaml if you just define those IP addresses, it'll automatically deploy and configure Keep alive D. Etc. So yes, it's possible technically, but just remember it is definitely not tested and supported in that respect so maybe good for a Lab or a test environment something like that Hey Andrew Do you mind if I go back and actually bring up something else that you were right near on the 4.8 release notes? I think it's No, an interesting thing to mention It's right beneath that and so what that means where it says MCO waits for all machine config pools to update before Reporting the upgrade update is complete. So this this effects this means Previously we were reporting that the update was complete when the control plane was complete And so you could still have worker pools That had not finished updating Meanwhile some people in some customers may have accidentally, you know may have thought that the That the update was complete even though some worker pools or you know, there weren't worker pool wasn't complete and they would then move to another Version of OpenShift, you know four six to four seven four seven to four eight And now you have now you have a bigger problem. So This way it will not show you the upgrade button until all pools are completed And just the only other thing I want to mention about that is that this is what we call it blocks What we call why stream upgrade? So it will block you from going from four seven to four eight until everything is complete But it won't block you from a Z-release from a from a patch release if you need to get that out and patch the control plane So anyway, I just thought that was worth sharing Yeah, thank you. I I Remember hearing about that, but I did not know it had actually gone into the Release notes or anything. So it's a pretty logical assumption. You know, it says it's done You think it's done, but yeah, you know, it turns out we really need to bring people's attention to the fact that Not all the machine, you know, not all the worker pools may may be complete. So Yeah Um Okay, well, thank you. Now, let's see I'm going to I'm going to tackle probably the most complex one next which is So there has been coming up over the last few days A large number of folks internally who have brought up that either they personally are seeing or their customers are seeing this particular issue And it's not just on 4.6. It seems to be happening across multiple other cluster versions as well So essentially what we're seeing or what they're seeing is a an issue or a problem being raised inside of the console saying that the system memory exceeds reserved Basically saying that you know, hey Something on the host Publix, etc is using more memory or more cpu than what has been allocated to it So if you proves through this bz Here's a a large number of comments in here. Well, I say a large number. It only goes down through six here Um, but anyways, it's being addressed. We're aware of it. Um, and all of that Now the reason why I'm bringing this up again is because I'm so one in the kubernetes slack Somebody had asked about this and what they had specifically asked about was What does the auto sizing reserved? do So a couple of weeks ago we had talked about this on the stream right where you know Hey, you can dynamically or have the system automatically dynamically allocate resources to the system reserved Value right so kind of exactly what this bz is addressing Now it's disabled by default and in the documentation here What we're saying is hey go in and create a machine config that sets auto sizing reserved to true So I dug into this to find out what does that actually mean what actually happens when we set this value to true And what are the values that it happens to use? So I'll answer that in reverse order here So the first one is what are the values that it actually uses? And it turns out that we inherit these or we use a calculation that comes from google So you can see This here is the same methodology that we apply And we know that that is the same methodology because We have in the code here So this is i'm in machine config operator again common files base files cubit auto sizing dot yaml so we have this Dynamic memory sizing Where it goes through and it takes right 20 percent of i'm sorry here It starts at 25 of the first four gigabytes and then 20 of the next four gigabytes up to eight gigs 10 percent of the next blah blah blah so kind of following exactly what is laid out here in google's own documentation So the end result is what happens is if I Enable this if I set this auto sizing reserve to true in a machine config That will then trigger a unit we can see here to Execute this dynamic system reserved calculation, which will spit out a set of values That are then added to cubit's own sizing So if you were to go just like we showed the last time if you were to do I have it up over here somewhere Let's see if this thing will work real quick. So if I do oc get node I do an oc debug So let's see what happens if I can connect into one of these Gotta wait for the pod to be pulled How good is azure's internet access? There we go So if I do a ps-ef and grab on cubit I can see kind of cubit's whole Configuration here including this where it stores its configuration here so more or less where i'm going with all of this is I set this in the machine config as soon as I find the right one here It triggers this script to be run when the node is booted Which then goes over and it sets inside of the host in the cubit config Those system reserved values for cpu and memory So more or less the larger the node then the more resources will be reserved And if I can dig around over here briefly I've got somewhere I'm digging around in the back on on my other screen here to see if I can find this There that's probably the easiest way for me to show these values So if you were to math out all of those values Essentially, this is what you would get so for example with memory If my node has eight gigabytes of memory assigned to it If I turn on that auto reservation setting it will reserve 1.8 gigabytes If I have 64 gigabytes of memory on my node Then that auto reservation would have roughly five and a half gigabytes reserved for system level things So this I copied and pasted this when I was just looking a moment to go from the kubernetes slack I will link that thread As soon as I can find it here Into The stream chat so that way anybody else who wants to look at it you are more than welcome to look at it But thank you. That's the uh, that's the kind of a quick and Quick and dirty version of what's going on there. So If you're curious if you're seeing that particular bug, please make sure that support is aware about it So that way we can associate any customer information that way they can prioritize it appropriately to get everything fixed Short term you can also turn on that auto sizing just be aware that it will consume some additional amount of memory and cpu for those system level resources So and just fyi we talked about this before the default So if you don't have anything set is one gigabyte of memory and I think a hundred millis cpus point point one cores So as you saw rates if I enable that auto tuning, it's going to by default bump that up with an eight gigabyte system to 1.8 gigabytes So you would lose some usable capacity there Okay And the last thing that I've got here. Oh, this is a quick and easy one So last one that I've got uh, just real quick. Some folks have asked. Hey, how do I know What will you know if I change something in machine config? What will trigger a node to reboot? We can actually see that in the code So if we look here, we have this function in the machine again machine config operator And we have this calculate post config change action as well as the one above it below it above it I have the wrong link. I'll have to fix this link before we share it Yeah That's okay. It's actually it's actually line 424 So essentially what if you read through this files post can change post config change action equals none So basically says hey if I'm changing the kubit ca dot crt or kubit config dot json Don't do anything right? They get updated right and the system automatically knows that that's something changed Right if I change etsy containers registries.conf then I need to reload cryo or prio However you pronounce it You know so on and so forth. So you can kind of walk through it's it's not You know documentation, you know as friendly as documentation, but you can kind of walk through the code here and you can see Hey, what is going to when I change something? What is going to be the resulting action in the cluster? So maybe beneficial to you If you're curious about that I'm looking into uh, yeah, they're I think some of this is documented But I have seen some confusion out there because people sometimes expect a machine config to cause a reboot and then when they choose one of these actions They can be actually surprised that it doesn't reboot. I mean, it's like maybe it's a pleasant surprise, but nonetheless Not what they were expecting You know it used to be before 4.6 it was a much simpler mental model um You change something and it rebooted and it rolls the cluster um, the one that isn't uh, it doesn't seem to be in this code that I'm fairly confident in is pushing ssh keys Um to nodes as well doesn't cause a reboot now um so and we're looking at we're looking at ways to uh You know either you know expand this list perhaps allow administrators to Expand it on their own, but it's a it's a difficult question and one that we have to Um be very careful with we don't want to put users into a position where Where they're going to shoot themselves in the foot too easily But at the same time, um, they do own their own cluster So we want to empower adnance and it's uh always finding that balance can be tricky. There you go There's the ssh. It seems that there's specific handling for ssh keys. I haven't actually dug into that. That's interesting I think it's also because yeah, I mean why perhaps because also, you know, whether it's added or removed um any change will Yeah, should should uh should not cause and uh the pull secret to the for those familiar with the pull secret on your OpenShift cluster that is another one that doesn't doesn't cause a reboot nowadays but that's a good one because uh, when I whenever I go in and uh, uh ad or change my pull secret so that I can access the internal versions right the The of the builds and it has to go through reboot all the nodes That's a handy one The one I think for the outside world that is the best or is the is the most useful is the uh, is that that certificate swap, right? It's the it's the ca certificate that signs the api to cubelit communication and That has a one-year lifespan and prior to four dot seven dot four when that when When 80 percent I believe of that of that one year would come up about day 292 You would just out of nowhere roll the cluster rolls And then at day 365 so it would add the new certificate We had the new certificate early so that you're not in danger. So there's a little some overlap But then day 365 it would come back It's reap the old certificate and that would be another reboot and the thing that was I think most problematic about it and customers had the most issue with it was They didn't know what was coming. I mean, it's one thing if you if you actually set a machine can take and you know I press the button and then the cluster rolled and it's another thing where people are like, why are the servers rebooting? and so this was So we fixed that Yeah, and I seem to recall that uh bare metal deployments were also contributed there because we we forget how long it takes to reboot hardware sometimes I think somebody, you know, some of the early adopters and for people who are running in very very cloud centric, you know Very very purely cloud native. They don't sweat it with the nodes Even rolling the clusters and that they could deal with them necessarily But when you have set clusters and like you said, it could be a 12 minute post For for these servers nowadays That reboot can be very costly. Yeah. Yeah, I gotta check all that ram and when you've got multiple terabytes of ram So we are we are looking at We're looking for like I said for different ways to to knock it down If if if folks on the stream want to send in their suggestions for particular actions They feel like do not require a reboot that they are hitting regularly That it would ease their pain if that was changed. Um, I'd be all ears Yeah, and you are Absolutely welcome any time to reach out to me. Uh, andrew dot Sullivan at redhead.com Or if you're on the chats, you can reach me on social media twitter at practical and just like the uh My username in the chat there 640 k of ram should be enough for anybody. You're right christian. Absolutely correct. It couldn't be more wrong ever Yes All right, so, uh, I know we've uh, we've got a little over 30 minutes now 35 minutes. Do we have a hard stop today chris? We do. Okay. Uh, so security is a I'm gonna put it mildly and say it's a big topic Right, there is a huge topic. Yeah, you know Just looking at at red hats and kind of things that relate to open shift and containers and cloud native and all of that, right The topic we can span anywhere from like the secure software supply chain and kind of developer practices and principles All the way down to like low level operating system things, uh, which is, you know, we're administrators here So that's kind of the side that we're gonna favor when we talk about these things. But my point is You could fill multiple libraries with the books that have been written on this topic So it's unfeasible for us to address, you know, kind of more than just a scratch across the surface when it comes to that And I really want to highlight a couple of of things. Um, so one is the security ebook And I'll post the link into that So this is a another step in that direction of how do I go and get started and learn more and learn about this broad ecosystem of things Also highlights red hat summit every year has multiple sessions on the security topic So you can always go there and find more information. Um, and red hat summit the last two years has been free registration, right? You can just go to the website Uh, and the other one and this one I haven't actually looked at yet because they just announced it like 15 minutes before we Before we started on stream Uh, but we just released a Uh, another yeah, so it's it's this an open approach to vulnerability management So and it really describes how red hat works and how red hat addresses these vulnerabilities across the product lines. So Uh, I while I haven't read it. I can only assume that it's going to be a good read and it's going to be an interesting one So I'm looking forward to it this afternoon after after the stream. Oh, wow You you have time to read during the day. That's lucky. Um, it's in between chats and emails. So Yeah, it's the advantage of multiple monitors. Yeah. Good point. Yeah I should take advantage of that at some point So it's a 23 page doc though. It should be a light read for everybody Yeah, and I also want to highlight that, you know, open shift and coro s Is only one component to that. Um, when we look at the bigger red hat portfolio, you know, of course, you have things that are directly related open shift, uh, acs acm quay Ubi read all of those play a role as well Not just, you know, at rel, of course because rel is the foundation for both coro s as well as ubi But more broadly things like satellite and ansible and using those tools to help manage and implement your security posture but today And the reason why we have these three fine folks joining us is to talk about a couple of specific features or specific things inside of open shift and I want to start with something that we have Talked about we've kind of flirted with it a little bit before and that is the compliance operator So I think and doran, if you don't mind, I think i'm going to start with you and I want to ask kind of two questions here so one what is compliance and What does the compliance operator do for that? Okay, so compliance is the way for Our for us for administrators and and people that managing the IT infrastructure To comply for a specific framework Okay, it's the way that I know that I can run workloads that align with my if I am You mean the industry of of banking? I want to know that my my cluster are PCI DSS, but I can transit Critical numbers if I'm in the in the industry of medical. I want to know that my clusters give compliance It said that the the industry is setting up a standard for compliance or what what controls I want to apply to my Infrastructure and based on this compliance. I know that my infrastructure my infrastructure is ready to run this workload so I want to separate between the technical compliance and the physical compliance or more process compliance technical compliance are things that we can accomplish by Setting things in the system like booking IP addresses or booking ports or Not allow specific users to access the resource But there are tons of resources tons of controls that are physical resources or user resources like We're allowed to log into the system. We're allowed to go to this to the data center all kind of things that are we cannot control or we cannot We cannot control by uh by technical controls So compliance support is focused on on technical controls We are enforcing and we we scan for controls. We're looking for for Things are that are part of the standards. We Very currently we have the cis benchmark standard. We have the uh, we are working on fed ramp and And PCI DSS and we have additional compliance a compliance framework what we already developed and We are allowed to run scans and scan the the cluster for this compliance and see if the cluster is compliant We also Allow to do we also do remediation. So which we found things in the cluster but are not up to the standard We the compliance operator Can can update the cluster run scripts to fix things that we can we can fix of course some of the Some of the fixes can be destructive. So we need to be very careful about it and we working on our team and Was in in leader is working very hard to make the compliance And we add more and more controls and and more compliance libraries to to our to the compliance operator So what's there anything you want to write anything? I mean, I think Uh, you summarized it quite well I mean the thing that we got to think about is that compliance standards are there for a reason, right? Normally you want to protect certain data PCI DSS. You've got to make sure that our credit card information is secure Right. Hippo medical records, right or a nerd sip That the important assets of a power plant are secure, right? So With all of that in mind Like you can imagine that there's a big big big list of requirements of you should do this and they can be quite vague So a big part of our jobs is trying to translate that into hey, well, if I want to secure this Or if I want to isolate our credit card data and the workloads that hold the credit card data We should be using uh tints and tolerations in open shift and use a network policy, right? So the translation from a What would look like a vague or random requirement? Towards a specific control or a specific setting in open shift is basically what the compliance operator is doing and what my team does And I think you're headed towards open scap as that that Definition of what each of those compliance standards looks like exactly Okay, and then is that Translated into open shift in kubernetes actions by the compliance operator or is there something else there? like And forgive me. I'm unfamiliar with open scap aside from it's kind of a repository, but it's one of those like I understand that maybe I have I need to have HIPAA compliance So i'm going to deploy the file or the compliance operator and i'm going to apply the HIPAA Standard, how do I know what that translates to from a requirements actions type of standpoint Sure, and that's an excellent question and Don't worry about not being acquainted with open scap. There's like five people in the world that know that stuff Come on. So no, um, so one thing that I want to Put forward beforehand is that um, we didn't want to reinvent the wheel, right? I know that there's a lot of people that made us a lot of questions like why openscap and the main thing is that it's already a Unexpected standard well scap is a standard openscap is just an implementation of it and uh, we just basically Try to make a community flourish right by adding yet another Product that they check so in really in compliance as code which actually compiles to scap um They check stuff for rel suce our cause nowadays open shift open stack They have some apple checks even so there's a lot of stuff in there and basically what they do is that they try to come up with a way to translate security requirements into Stuff that's automatable. So we're going to have data such as This we're checking for this specific setting in the cluster We're checking this because this might be A way that an attacker might get to you. So write the description and the rationale We also have information about how to manually check for something which is going to be something used by an auditor Or maybe you want to check by yourself like I'll see you get pilots and give me the ones that are privileged right something like that So we have all this information and we also have automated checks Right. So there is this standard called oval Which is part of scap which allows you to do this kind of thing Um, and we are going too deep into details as well. We also have references so A profile is going to be a collection of rules and those rules are all of the information that I put before Right, so you're going to run. Um, I want to be compliant with the moderate profile. Okay, that's going to run a lot of rules Are you enabling this audit rules at your host? Are you enabling at cd encryption? do all of your namespaces have appropriate network policies And it's going to give you a report and that report will contain the Reference from nist That that you're completing with that. Right. So eventually you can bring that to your auditor. They're going to be checking. Oh, so you need to meet Si seven. All right, cool. So here's in the report and so on right. So you're cross referencing What comes from nist and what comes from the compliance operator and it's not only nist, right? We have references for pc id ss. We have references for nerd zip Uh ans is coming. So I don't know if I answered the question but uh, but we don't check for specific control from a standard We check for a rule and that rule happens to reference a control from a standard Got it. Yeah, so I'm I'm most familiar with like the cis benchmarks And they're set of rules and what that looks like and the automation in order to check and also apply those so To me in my head. I map this over to This is just a standardized way of defining those rules associated them into Um, what i'm going to call an inventory of rules that make up a compliance baseline And then how to you know check and then ultimately remediate against those is that accurate Right, exactly. And one thing back to the remediations that you asked about, um Normally in openscap we there is a thing called a fix And the fix can take many ways. You have several systems to apply a fix, right? So scab actually supports ansible. So you could generate playbooks out of that Scab supports bash scripts. And there was another one that i forgot anaconda. I think you can form your anaconda definition there In our case, we introduce something that we call just a kubernetes fix So we only deal with kubernetes objects, right? We try to play nice with the cluster in open shift four Everything is a crd. Everything is a resource. And so we we make sure to play nice with that So what's going to happen is that we we if we detect a failure We get the fix for that failure That turns into a kubernetes object and the compliance operator will Either report that so you can apply that yourself or it can apply it itself Depends on what you want when we we give you that flexibility. Yeah, so we're not gonna be changing Right, we're not going to be changing stuff under you like we're going to be using machine configs We're going to be using patches towards, you know, the api cluster object and so on And just to be clear all of this The compliance operator is one component that works in conjunction with ACS advanced cluster security that works in conjunction with acm I see christian in the chat, right that Ultimately can work with things like get ops. I'd be curious to know your thoughts around, you know Is there a preferred way of do I deploy the compliance operator? I tell it to, you know, Check for and then remediate against this standard Is it worthwhile to just let it control that configuration? Or should I take that applied configuration and then manage it using something like get ops? We try to be flexible and just give folks the the power to do whatever they want Honestly, I've been experimented with get ops quite a bit recently And so the compliance operator and this cap, you know, data stream, which is the collection of all the roles and everything It is a big repository of fixes or best practices for for ascension standards So there is no reason why you can't use it with argocd or with With hive and stuff like that. So we do have work ongoing about bringing that together So you would be able to do one command like oc compliance fetch fixes for argocd And it'll just download it. You can put it in git and then handle your multiple clusters with it Right, so so it will be possible and you don't exactly need the operator to do that But you are going to need the operator to keep generating Scans and the operator is actually able to leverage all of that information and tell you Hey, you're compliant or if you go out of compliance, it'll just Give you a fix for that Yeah, you just said what I was getting ready to say, which is you would still want to use the operator for if nothing else auditing And and having that set of logs So I'm going to change directions a little bit and I'm going to Move towards the file integrity operator And this is one that is really interesting to me and and mark and I've had conversations about this before because In the early marketing materials that we had for Chorus we talked a lot about how it's an immutable operating system And you know, so having this, you know operator whose role is basically to Check for and tell you if something changes like how does that mesh with how does that? You know resolve against this immutable operating system So I don't mean to put you on the spot mark, but I'm gonna I'm gonna put you on the spot Sure, it's my one of my favorite topics. So we do we do sometimes call it controlled immutability And if you check the actual documentation the definition of what we have is that, you know, we we distribute and Install a read only Slash user so all the os binaries in that sense everybody at at whatever rev of the cluster you're at You are running a specific version of relcore os that's been tested with that version You're running the same version of that slash users everybody else running the same version of that cluster Etsy and var for different reasons are are read write Var obviously needs to have some state needs to have a place to cache containers Etsy is there though as read write for configuration purposes though you're only The idea is that you should only be changing it through machine configs And not manually and so where the confusion comes in and I think where you're getting out what the question is There is sometimes an assumption that The mco the machine config operator Will tell you anytime anything has changed and that that's what it does and and it does not it implements the config and If you and it looks after the files that it controls So if it has pushed a file and that could be ones that you choose to push and that could be ones that are built into machine configs in the system If you look by the way if you'd like to look Um, I mean we could bring this up But it's it's gonna be a little ugly if you look at the rendered every pool has a rendered config And what a rendered config is is basically just snippeting all taking all the snippets of configuration together And putting them into a single configuration and you can see The files that it that it manages Um, please don't ever delete these If you do delete them, please call red hat support immediately after right Yeah So it pushes these and and and it's you know, it's base 64 encoded most of the contents In some cases so these if these files are changed The mco will notice them the next time you go to push out an update It does not currently Today we're looking at changing this it does not check on a regular basis It's when you push out a new configuration or when you push out an update It checks to see hey before I change make this change It doesn't know look the way I'm expecting it to look Yeah, and if not, maybe we should stop here and tell the admin, you know Dear human being you might want to look at this Yeah, and and just to reinforce that I've seen two instances of that happen just recently One was a customer who had modified the the ssh keys, right? They just think they had debugged in and added a different ssh key And of course when mco went to remediate it said, I don't know what this is and calls to to uh to halt We are looking to make that on that particular note We are looking to make that check more regular so that In the case somebody has done them, you know, you do a test modification You forget about it and then it's going to rear its ugly head later when you're going to update during your maintenance window So we don't want that we want you to know as soon as possible That there's something uh unexpected in the on disk state And this is where I think the question will segue over Is that if you do want to look at, you know, you want to know I want to know anything that changes under etsy Uh, you know, and I want to watch these files specifically I think that's where where we do we have that but that's where the file integrity operator comes in Yeah, and so that's really the big difference, right is Machine config operator and the machine configs It only cares about the files that It's been told to care about. Um, so, you know, yeah, we've got this massive file here You know with lots of things It could be a little bit tricky to to figure that out, but yes, um It it's not going to be I think the takeaway though is that It's not looking for any change under etsy if the machine if the mco is not Managing that file. It's also not going to care That you have created, you know etsy mycoolconf.com. Yeah, it's just not going to notice So whereas the file integrity operator and and now i'm getting out over my skis a little bit so the file integrity operator essentially When it's enabled it goes and creates a database of I'm I'm going to say hashes. I'm guessing hashes of What those files look like and then it Periodically rechecks those and I don't know what that interval interval is And basically checks to see if it has changed at all and if it has, you know, hey this this changed. Did you know this changed? type type of thing I have a diagram of how the file integrity is working in an experience process So one one question I do have for you mark while doran is bringing that up is The way that coro s does updates right is rpm os tree so We know that var and etsy are Mutable right they can be changed When we do those rpm os tree updates when we switch to that new tree will it replace? Will it reset any of those files under those two file systems? I know it does not it okay. It just it just does if you Yeah, it's it's it we distribute what we call the machine os content and that is that is slash user Okay, so essentially an update will only affect Through rpm os tree will only affect user All the rest of it is more or less left to its its own devices from an update perspective You know, I'm trying to think if you know within Yes, it should be I mean in theory there could be a rail package that updates the configuration file Midstream in theory. I don't think that happens a lot though, especially within a rail major, right? Like within a rail eight, which is where we take our content for our cost But anyway, so yes, I mean It is a theoretical Change there, but no I don't in reality It's just going to be slash user and etsy is going to stay maintain the way you had it Based on the mco and the rendered config for that pool Got it So yeah, I'm I'm learning new things all the time See this is this is the problem with andrew no longer being like a full time like my my paid job Is to be an administrator is most of the clusters I create they might live a week Yeah, so and of course, they're all lab clusters. They're all test clusters So I generally don't go through the process of deploying and testing and and seeing all of these things in action Um and mark mark and drawn. I know, you know like uh, and we I briefly mentioned this on one of the other streams of You all the pm team has a cluster that you use And that you maintain like long term running so that way, you know, we can do Well, you all mostly get that hands-on experience with the product with all of the features with everything else Which I think is really cool So I was going to say and and what's what's better or worse with with every option installed together This is the top top trim version of open shift So, yeah, it's uh, oh never a dull day right when it's your when it's your is it two week rotations for who's responsible for the cluster. Yeah So so mark if you don't mind I want to talk a little bit more about some some coro s things Um, so one as you mentioned, it's built on top of rel. We've mentioned that a bunch of times because we inherit things like drivers hardware compatibility Um, a lot of the other security features from rel itself and that includes things like sce linux Um, and I know sce linux was uh, we recently brought it up because of the nsa hardening guide So they had published that hardening guide and I think we've got a link somewhere. Yeah, I had a link somewhere um And I kind of I think most of us looked at it and said, oh well Most of those are already applied If you're using coro s an open shift, right either through sce linux or through other mechanisms that we have So I thought that was really interesting. I don't know if you have any other Comments around sce linux being used with kubernetes and open shifts or not I just feel like uh, just just to back up, you know, how well it works Um, you know, how we build off of rel and how important that is so You know, I'm not necessarily I'm far from the deepest even at the p.m. Level Uh sce linux expert, but what I do know is that You know all the core services that we get in rel core os come over from rel And we have years of experience in isolating those services from each other isolating them from user workloads um And so we protect them from rogue activities on the system We protect them such that if one is compromised that that compromise doesn't lead to Further compromises on the system um In the case of open shift really I'd say that the layer on top of that is that we have also now years of experience managing you know and extending the isolation between containers Beyond what container isolation right out of the linux kernel um gives you and You know, so we have policies that separate all the platform containers sdn, etc um that um You know again protects protects customers and their core systems from from flaws in one component or another um I think there's something else I sort of wanted to throw in there. Um, it'll come back to me I have that problem all the time But in I I think this was it. Um, we we've never had uh to my knowledge, uh, and you know feel free to Anyone to jump in but we've never had like a run seer container escape That wasn't already at least mitigated if not prevented out of the box by our policies Um, that could be the the read-only user which which can which can help in some instances But typically it's it's it's se linux that stops these This stops these cvs for being very exploitable on the platform Sorry for the sales pitch, but it's true. Well, no, I was you know when we were doing the notes And prepping for this, uh, I I brought up one and I'm like, I remember there was one that we just talked about this And I found it, uh, which was there there's an rhsb and a an appropriate cv which I just posted in the chat Around it was the one that we talked about here on the stream. Actually, uh, long path name and mount point flaws in the kernel And like if you scroll down in this thing and here I'll I'll uh, let me copy it over to the right browser And then I'll share my screen real quick here So if if you scroll down in this particular security bulletin And look at the mitigation like the mitigation is basically You're using rel congrats You've mitigated it. So, um, I I thought that was kind of a great example of You know the the underlying platform does matter, right? It's which linux you use does matter. Um, and yeah to your point I know it's a bit of a sales pitch, but um It's important I just want to throw in though like long path like Exceeds one gigabyte long path and it's not even doesn't even describe it if your path is a gigabyte I mean, that's really impressive Um, let's see. Where did I get lost out here? So, okay. Um, the other thing I wanted to ask you about is FIPS So I think FIPS compliance was added to coro s back in like the For three days. Oh, yeah for two or four. Yeah So it was pretty early on um and So my understanding of FIPS is it is a it's a u.s. Government thing around encryption and specifically There's a bunch of different levels that I've never quite understood myself um, and That all plays a role in like FIPS 140 dash two You know or level one level two level three so on and so forth so can you elaborate or help educate me on FIPS mode and open shift and what that means Uh, sure. I mean, I I know it more from the real chorus point of view. So if anybody else wants to add on I will not be hurt Um, what we have is you have the FIPS validated modules and the modules and process Um, so we you know, we only we we ship one or the other Depending on where you know where it is in the cycle What that means is that yeah, you're using you're using FIPS validated kernel crypto modules open SSL And And when you are using a ubi container So not only does that mean that the services will be on that on the nodes will be running in tips mode meaning They're only going to be using those ciphers From those particular like set ciphers from those modules It also means that if you're running ubi containers On those on top of on open shift, they will be automatically using the correct ciphers as well, I think I don't know if uh, uh, Oz if you want to add anything to that, but it seems like I wasn't too far off No, that was about right and that's the main challenge usually with FIPS in general that uh If you run a funky container with some funky base, uh, it will not inherit FIPS because it has no way of knowing that All right, I suppose through using a rail based container like ubi or anything like that that it actually the text that you have FIPS enforcing it configures the Crypto policy in the container itself and makes sure that you have everything right and ready to go And you don't need right and then you have to worry about it the the app or container level What I do want to point out and I don't I don't quiz me on the difference I don't know but we are I think already to getting to the point where it's 140-3 not 140-2 And I think maybe this is not super well known outside of the bubble But you can't submit versions. So when people are waiting for for full FIPS validation It takes a while because we are at the mercy of NIST And not only that we can't submit until it's it's g8 We can't we can't give them a beta early so that we can come roaring at the gates with you know rel 8x Is FIPS validated we have to release it and then submit it a strange question for you has that process been slowed down at all by Like coronavirus and all of that it was and I think there was some split attention I don't know if anybody else can add to that. There was some split attention both Uh, coronavirus but also the development of 140-3 was happening simultaneously And also we are not submitting the full platform Submitting only the colorized version So because you know, you can you cannot FIPS compliance for the whole platform a little bit problematic And it's basic it's basically for customers that want to use in the federal I need to use the open shifting federal form or other federal compliance level, but it's the basic building blocks for federal So I'm I'm going to ask a quick question and I'm not sure who will be able to answer it So I'm not going to deliberately throw anyone under the bus So I know a lot of times as administrators one of the frustrating things that we first encounter when first using open shift is No root user in containers. Yeah So I know it's kind of it's related to a number of different things SCC's sc linux and all that other stuff So I'm curious if anybody has kind of a terse rate a 30 second version of the perspective on that Because I think a lot of people, you know might think oh sc linux, right? It should protect me Why why does it matter what user ID? I'm using if sc linux is in place and all these other things all right Okay, well my standard effect so I will um, so no, I'm not gonna force anybody to put anything out there So I'll I'll poke some people and see if we can get an answer that I'll include it in the blog post You're talking to infrastructure people sometimes and sometimes it's like wait. Whoa. Whoa apps I know right? I have that issue inside the container So well my dog is growling which means that it's probably time to go Just to answer real quick because I was hoping that somebody else would answer in case there's a container escape Your user ID is going to prevent you from accessing anything from the host So linux is going to help But there are things that your container is still going to have access to even if you're using a silenix, right? So having a non-reducer is going to prevent you from or prevent a container from accessing anything else So it's just like an in-depth approach to security where hey in case you bypass silenix You still can bypass the next layer, which is your standard unix user ID Yeah, it makes sense even if it is a bit frustrating as a first time open shift user when you can't just go You know use this docker image that comes from Somewhere anywhere or and or even my own ones, right? I have to go in and set that uid correctly In order to get it to work So it's one of those like it is a valid thing even if it can be a little bit frustrating Maybe even a little confusing at first So it is almost the top of the hour, which means that we do have a hard stop today So we need to we need to close out the stream. So first and foremost, thank you very much to our guests today So mark as Doran really really appreciate you coming on really appreciate the time that you gave us today I can't thank you enough. I was especially. I know you came in very last minute. So thank you so much for that To our audience. Thank you very much for your attention today. We really appreciate the interactivity I've seen the chat scrolling by y'all have been great. So I will Take all of this information. I will put links to specific spots in the video. Keep an eye on the blog post So that's cloud dot redhead dot com slash blog now So I will have that blog post published. Hopefully friday. If not, maybe early next week And with all of that being said, I hope everybody has a great day a great week And I'm going to steal chris's line, which is stay safe out there Awesome. Thank you Take care