 Hello, everybody. Thank you for joining. My name is Juan Antonio and we're gonna start hopefully you're here because you're interested in compliance in OpenShift Automation, if not, you're still welcome to stay, but I understand if you got here by mistake. That's also not an issue. It happens So we got a pretty big agenda for the limited time that we have but we're gonna go through it and It'll be fine. Hopefully we're gonna answer these questions. If not, there is still a Q&A time slot enabled So It'll be fine. So first a quick introduction about who are we? Who are these weird people that are talking to you? My name is Juan Antonio Sory Robles, but that's pretty long. So you can call me Oz I'm a Mexican living in Finland and nowadays I do OpenShift mostly security and compliance though So that's who I am I'm Jakub Rozek. I'm Working on OpenShift security and compliance with Oz and before I started working on this project I used to work for a long time on SSD and free IPA Right. So now we're just gonna go very quickly about why we're even Doing what we're doing. So Jakub All right, so The work we are doing is making sure that OpenShift is able to pass Compliance test compliance is compliant and we're doing this because Unless piece of software is compliant, it can't be used in some industries or environments think banks Militaries governments like that Our biggest the user or customer whatever you want to take it is the North American public sector This has implication on what standards we are trying to be compliant with first And while we are doing this software work and automating the stuff in theory you can just download the standard and Go to the cluster and make it compliant yourself But that's already quite hard work if you're doing it with like with one machine one VM one server If you bring this up on the cluster scale, it becomes even harder and With OpenShift you're not just Trying to make the Kubernetes layer compliant just the just the cluster level but under that you have nodes that the cluster runs on and Underneath you also have other layers that we don't Necessarily care with our software with the compliance operator like you might need to make sure that actually the hardware You're running on is compliant or that the people who are operating the the cluster Pass some background checks and whatnot. That's outside our jurisdiction, but the point is to make sure the whole system is compliant is very hard and it gets much harder with the cluster and what makes it even harder is that Kubernetes is an upstream project and OpenShift as a product are very fast-moving and the standards that you are compliant with are quite big So if you are doing this all this all work manually Once you are done, the next version comes and you can as well start over so it's kind of prudent that the compliance work is done in automated fashion and and if possible with the With a click of a button and the last thing is that the standard themselves while they are often freely accessible You can just download them and process them yourself They are not easy to read. They are written in sort of lawyer English and you need to go from that to some configuration directives and It's not easy to to process the standard and figure out How do I how do I go from the sentence in the standard to some configuration settings on my cluster? And this is all done automatically by the compliance operator and the compliance escort project Next slide please So the the team we are part of is called infrastructure security and compliance. That's me and us and another developer Matt Rogers who I'm not sure if he's there, but he's not presenting What we do is we On one level we go through these standards three compliance controls and Try to codify them as Compliance content that can be evaluated and automated by the compliance operator The compliance operator is another big part of what we are doing So it's a it's an open shift operator That can run in your cluster can take the compliance content evaluated and based on the content tell you if you are con compliant how much you are compliant and Help you with mitigating any issues that You might have As I said earlier a big part of what we do is for the North American public sector So the standard that we start with is called federal moderate. There's also kind of Acronym that it's known known by at least eight hundred fifty three It's not all that we are doing But we are starting with that because all the other standards even outside North America like Australian E8 and whatever else might be used around the world is sort of based on those standards so it's not like we are just doing the NA public sector work and then we're going to start over and do something completely different the other standards are based on that so If we start with the federal standards, we will be covering sort of implicitly the other work Um, except for the compliance operator that we will talk about in detail in this presentation. We are also working on another operator for File integrity of the cluster nodes That's called file integrity operator. It's not mentioned very much in this talk or not at all But it's also cool project and check it out if you're interested Next slide please Okay in the Through Yeah, yeah, sorry about that, right? So now we're gonna talk a little bit about what we actually did which is the whole reason of this talk and So as folks might know there is this cool new thing in OpenShift or well in Kubernetes in general called operators So to some extent we did develop a Lot of the stuff for the operator to be able to do scans in the cluster in a more automated and friendly fashion However, we didn't we can't take credit for everything, right? So I'm just going to mention that there are three big pieces to all of this work, right? the first one of course is the operator and the operator is a controller that listens for specific objects and Helps you get into compliance track your compliance and then Will give you guidance on how to reach there if there are any gaps, right? Then we have openscap, which is a tool that we use to actually do the checks and that is a nice certified tool in order to Follow policies code them into a language called scap actually and evaluate it, right? So this is a long well, it has a long history of a project and So we can't take credit for that one, but we work closely with those folks that's a great team as well and finally of course what is a Compliance tool without content, right? You want to do checks and you want to give explanations about what you're checking So that is what the content does, right? There is a project called compliance as code where you are able to write content for different platforms and This is what we're talking to use right So getting back to the operator We expose a lot of custom resources in order to be help you automate your compliance story It looks very overwhelming and it kind of is but don't worry There is a subset that you could care about Right, I will go through them in a little bit. So you don't need to worry about that either The first thing that you want to do though when you're Trying to reach compliance where value for compliance is figure out. Hey, what am I gonna comply with and how am I gonna do it? And so this is the first step and the first thing that you're gonna have to define It's actually the only thing that you as a user need to Choose pick and choose right for your cluster. So the first thing of course is a profile right a profile is a definition of a Benchmark so to say and it contains a list of roles, right? So for the for this use case, for example, is we're choosing the essential eighth benchmark and it contains the appropriate descriptions some metadata about what it applies for and Ultimately what you're gonna do is check a specific set of rules Right, so they're defined over there after that you have a rule and The rule defines what specifically are you gonna check? Which is the most granular thing, right? So it could be that you're gonna check file permissions. It could be that you're gonna check for a kernel argument There's many things that could be done. So that would be the rule and rules are included in profiles as I mentioned For Example in this case, we also include a description of the rule to help you audit what you're doing, right? So you have some more metadata such as severity even a warning And so in the final report, you're gonna see all of this stuff, right? and of course a big thing is how do I fix this Right and in this case, this is the machine config object Which is an open-shift specific object that allows you to configure the operating system So This is included as part of the rule, right? So as part of this rule if I find an issue, this is how I fix it. So all of this information is there After you have this stuff Well, one thing to note is that we try to be friendly and provide you with some predefined profiles that you can already take it to use out of the box and Which we're currently testing and make sure that they're solid, right? but In reality, no size fits all and most likely than not you're gonna have to tailor your profile Which means change add rules remove rules. And so we have an extraction calls the profiles, right? And in which allows you to do just that right in this case Maybe you want to throw a joke at somebody and put a cylinder is permissive Please don't do that, but it's part of the example and this is exactly what we're doing in this role Finally, as I mentioned before we also want to configure The specific operational Settings of your scanning, right? How much storage are you gonna put there? How often are you gonna? Scan your hosts, right in this case. It's every day at one in the morning What roles for the notes in your cluster are you gonna scan, right? And all of these gets defined in the scan settings. You'll we provide a default But most likely than not that will not be enough for your case You will most likely have different roles or want the different rotators rotation strategies. So you'll define it there and with that in mind Finally, what you want to do is specify your intent as with a lot of Kubernetes objects, right? So you want to say for example in this case, I want to comply with the essential aid profile And that's it. You bind the settings to the profile So in a way of saying please scan my host with these specific profiles and the operator will just do it The next thing you want to do is keep track, right? You want to see how this cancer doing? so We have an object called a compliance suite and it allows you to do that It helps you keep track like it will keep a face of where this cancer at, right? Is it done? Is it running? Is it aggregating the results? It helps you get events. So you could actually write a A small service that listens for events when you have a result and do actions on that, right? Maybe you want to send a mail to your SRE or maybe you want to just Copy the results up to somewhere else like off load it for example So you could do such a thing. That's the intent. And finally, of course, we have objects for the results, right? One of them is going to be a remediation that you can apply and the operator will help you do so Next you're going to have the check results themselves with metadata about what happened. And finally, you'll have raw results Which are in a format that some auditors are used to seeing, right? So we store that in a persistent volume as well I'm not going to go through examples of this one's here, but you're going to see a little bit of that in the demo Next is the how Jakub Right, so the how section tells us how these scans are actually performed on the technical level It sort of goes up from the bottom from the low level scans to the abstractions that Os mentioned while ago So before you have the scan, it needs to have some content And the content is developed in this compliance as code project That it's compiled and the result of the compilation is XML file that's called Datastream Datastream is put into a container image and the container image is pushed to a registry And while you can reference all the low level openscap objects in the compliance scan API objects, it's very user unfriendly So there's this profile bundle object that we use to encapsulate the openscap Datastream and the compliance content And what comes out of this profile bundle objects are all these rules, profiles and variables that Os was showing earlier And the compliance operator on startup would create two kind of default profile bundle objects One is for OCP, so for the Kubernetes cluster level and one for the nodes and by default we support ARCOS So by default there's an OCP profile bundle and an ARCOS profile bundle These are meant for usability so that instead of referencing to the long openscap identifiers that nobody can be expected to remember or type You just reference the objects like rules and profiles that are parts of these profile bundles just by name Next slide please Okay then the lowest level, lowest object on the abstraction level is a compliance scan which represents a single scan And a single scan scans either the Kubernetes API objects or a set of nodes Typically you would scan a machine config pool because all the nodes in the machine config pool are the same The scan represents an orchestrated openscap scan So there's scan reports that run actually openscap, they are fed the information through the content Typically you would do it by referencing the profiles and rules and whatnot Then openscap itself runs and produces results There's sort of two kinds of results that are produced and we'll talk about that more in detail But briefly for now there's sort of compact results that just say if somebody will pass or fail And that can be used to produce results as Kubernetes objects And then there's a much bigger result, actually result file called ARF file that auditors are normally used to And that's too big to be stored in a CD as Kubernetes objects so it's offloaded to a persistent volume And how and why exactly we'll talk about later I think in more detail And because especially when you're scanning nodes you would scan many nodes but you want to represent the result for a set of nodes as a single result You don't want to see 1000 results for 1000 nodes So there's a pod called aggregator that looks at each of the compact results stored in config maps And aggregates them to a single result so for each scan that scans a set of nodes in a machine config pool you would end up seeing one result Okay, there's two kind of scans as we mentioned a couple of times before that you can either scan the nodes in the cluster or the Kubernetes API objects This distinction is represented as two types of scans one is node and the other is platform scan The node scan looks at the cluster nodes themselves so typically that would be ARCOS How it works is that the pod that performs the scan is a privileged pod It mounts the host file system at some known location in the pod And the openscap then runs sort of as it was scanning a truth where the node file system is mounted And in this case you would have one pod with openscap, one scanner pod per node in the cluster And the platform scan is scanning the Kubernetes API objects It scans just one singleton instance of the scanner pod that is not privileged Because that pod doesn't have any business in mounting the node file systems it doesn't have to be Before actually doing the scan the pod prefages the Kubernetes API object that would be looking at like config maps, secrets and whatnot Dumps them into a known location and then runs the openscap scanner on that known location Next slide please So this is what the scans do, they more or less wrap openscap and perform the actual scans One level up above the scans are the suites that I mentioned before So they provide first a group to get a scan So if you're scanning a kind of the typical default installation of OpenShift you would have one scan for the master nodes You would have another scan for the worker nodes and then you would have a third scan for the Kubernetes API objects They would all be probably listed in one compliant suite So there's the list of scans The suite also exposes aggregated status of all the scans So by aggregated means that if you have three scans one of them is already done the other two are running The aggregated scan would be displayed as running and not switched to done until all of them are done And the suite also exposes kind of sugar around running the scans actually So there's two things that are worth mentioning one is a schedule So you can with the suite you can say that the suite runs the scans periodically with a Kubernetes Chrome job And the other thing is that in the suite you can say that you want to just trust us And automatically apply all your mediations and kind of close all the gaps toward the compliance automatically That might be useful but probably should be used only once you verify that all the settings are actually okay in your environment Okay and then going up from the scan and the suite is the scan setting binding So that's a abstraction that lets you generate the suite and the scans Without actually having to type out all the OpenSCAP details So instead of typing the identifiers that OpenSCAP expects that would be like xcdf underscore something underscore arcos and so on You can just say I want to scan this group of machines using the arcos-felrem-moderate profile And the scan setting binding would generate the compliance suite for you And the compliance suite would then generate the scans for you and all like this And as I said earlier that some profile bundles are created by default So in the easiest case you can just reference the objects that are already there for you You don't have to set anything Next slide please now Okay and a little more The lights keep getting off here It's probably similar camera for a bit So as we said a bit before there's two kinds of results There's the compact results that are small enough to be put into a city And there's the raw results that are stored in this format called ARF I think it means assessment result format These results are very often required by auditors or at least the auditors are used to them and there exists some third party tools that visualize and correlate the results and so on The ARF result while useful is huge It's like think dense or even hundreds of megabytes big Excel file So this is too big to put in a CD into config map or any other native Kubernetes object It's need to be stored somewhere else What we did was that once the result, once the ARF test result is created The scanner pods upload the results into a persistent volume So as a user what you would do is you would then spawn a bot that mounts the persistent volume once the scan is done And copy the results somewhere else for viewing in some third party tool that you might have Alright so with all that said and done now we can try it out Now we're not going to do a live demo even though this is a live session So this time we're actually have a recording for that right so Let me try to copy it here Oh I could not copy it there, yeah there you go You can see it So Oh I messed it up Let me try to Do this Right hopefully you folks can see the whole thing But okay as Kinema has been a little difficult but okay we can see some things right So the first thing that you already see is that We have some profiles that already come with the installation of the operator So what we're going to do is check out one of those profiles right This one is the essential aid one There are some extra fields that come from CubeCuttle but that's just what you get Oh is it possible to make it larger? Okay let me give that a try I can give that a try, wait up I think I could do that larger but I would need to switch to the other screen Yeah I'm going to try to switch to the other screen and hopefully that helps Hello So then I'm going to switch this one to my other screen over here There you go And let's give this a try Right what about this is this any better? The font could be bigger Yeah I guess what about this though? Right maybe we can continue with this one Right so the stuff that you already saw some profiles that are already there Not much difference alright well that's a bummer So in this case I'm just checking out the profile as you can see there's not A bunch of rules that are enabled in this profile The next thing to do though is well check out one of the rules Right just the curiosity In this case we're going to set the K pointer restrict from the CIS controls in the kernel And it gives you some metadata as I mentioned about it like the rationale Or there was a fix enabled over there And yeah let's try to take it into use First thing that we want though is to put some settings for that Right in this case we're just going to allocate one gigabyte for the storage It's a very small cluster in bigger clusters of course you would need a lot more And there is guidance for that and we're just going to run this every day at one in the morning Right for workers and masters Finally we just apply right we want to comply with the e8 profile and with my settings And let's just do let the operator do his thing Right which is exactly what it's doing right it's going to be running the profile And until we get a result is going to be in state running And then it's finally aggregating the results which means that it goes through all of the nodes in the cluster Check the results do I have inconsistencies or not Or is everything a okay And finally after aggregating the profile we should see a result fairly soon Right so in this case of course this was an out of the box cluster So it was not compliant unfortunately And we can see what exactly failed right in this case there's a lot of audit rules that were not set And they're not default so those can be easily fixed There were some SSHD parameters as well that we're missing But that's a lot of them we can also actually just check the failed results with a label Which makes life a little bit easier And you can also view the results and it'll tell you more or less like is this a bad thing Right how bad of it is it like medium high severity And a little bit more metadata about it In this case we're also checking the remediations none of them have been applied But in this case the remediation was a machine config As I mentioned the machine config is a OpenShef specific thing that allows you to configure the operating system And Yeah so We just try to apply it and as you can see the machine config object got created And That's about it related to Remediations right so Here was the creation You can see machine config object in there and finally we want the raw results Right so those are they have persistent volume claims in the cluster One thing that we can do is create another part to fetch them or actually I have another utility to Make that a little bit easier But yeah this is just the basic flow of what you would do in your day to day when you're running the operator I'm sorry it went a little fast it was a little difficult given that I could not Maximize this But yeah so So that's it over there Now we're going to talk about some challenges about the building the operator itself One of them as mentioned was the result server Which is the server that actually fetches all of the raw results And the issue was that initially we were not aware that it CD had that limit because we were near to that so we had to write it to begin with right and so Then we came to the issue that Not all of the default volume types Accept read write many right a lot of them for example in Amazon the default is read write once which will only allow you to mount one volume per note at a time right so That got a little bit tricky so you got to be careful when using that one Therefore that's why we needed just one result server as opposed to just writing to a shared volume in all of the nodes that are doing this can that would have been way easier but that wasn't possible at the time Another thing is that now that we actually push the results towards a single result server We have to do it securely right because these are potential security findings. So we do have an ephemeral PKI that's created just for each scan. So you can only get and receive results from the notes that belong to that scan from the pods. Sorry. Right so so that was another thing that we did Jacob will explain the next challenge though Right so As I alluded when you're scanning the notes you might be potentially scanning hundreds or thousands of notes but you want the result to be represented as a single object right That sort of assumes that all the notes would have identical results which should be the case. Normally you would scan a Notes in the single machine config pool which should be identical but There might not be. So what if one of the notes is different If next slide please If one of the notes is different it might be because the admin just did you know OCD bug or a station to the node and just run vi and change the file Or it might be worse. It might be a breaking attempt Either way what we want to do is we want to direct the attention of the admin or the compliance officer who's running the scan to this issue and make them aware. So in this case, regardless of whether we can move on from the state to a compliant we always flag the result as inconsistent. And because there can be many notes and we want to make it easy to find where the inconsistency is we also try to find the most common state among the notes and just flag the inconsistencies. This is all visible as Kubernetes labels. I forgot the exact names but it's in the docs. So once you get the full set of results you can just filter by labels and see okay this note is different from the others and this result is consistent so on. Finding the most common state is not always possible like what happens if you have just two notes right and one of them has this state and the other has the other. But if it's possible we try to find the common state and if possible we try to make it possible we try to make it so that you can convert from that inconsistent state by applying the remediation. That's also not always possible like in case one of the notes skips the check completely. We have no idea but what its state is but if it's possible we always try to generate the remediation. And once you apply the remediation you would get to the consistent state. Okay the other issue we were dealing with was with contents updates. So the remediation as you maybe saw in the demo are stored as Kubernetes objects for the note updates there will be machine config objects. And while you can patch the object like calling OC patch or keep CTL patch this still replaces the whole thing. So the whole machine config or a whole config map. So with the machine config this means the whole file is replaced. But what if the remediation needs to be updated like we messed up the remediation we actually need to set some other contents of the file. Or the file the package that we set configuration for has been updated rebased and its default have changed. In that case we need to push out a new remediation. Because it's probably not a good idea to just do this automatically and not let the admin know what we do is in this case the remediation object that was previously applied is flagged as outdated. And in the object like if you dump it with OC get YAML. You would see both the current object that is applied to the cluster now so that represents maybe the file contents on the disk. And the new version of the remediation. And this gives you the opportunity to review and revisit the the object and apply the new settings at your own base once the contents have been vetted tested what not. Okay, when is this going to be released. The operator will be part of OpenShift 4.6. As far as like front-end OpenShift goes this will be released on operator to help us. Many other operators. But the important thing to note is that that's the operator it doesn't mean that we will release all the content and you know people will be able to make their clusters compliant immediately with all the different security standards. The content will come a bit later and the content releases are designed or reworked so that the content releases can be asynchronous and not depend on the OpenShift schedule or operator schedule. They're completely coupled. And in 4.6 as far as the nodes go, nodes go at least only ARCOS will be supported. Difficult issue from right now. So now we're going to go through some frequently asked questions that we have gotten over the time while developing this operator. The first one has been, why are you using OpenScape a couple places code that is all technology and so on. But in reality, I mean, it's a standard and it's something that people already have tooling and automation on top of right. More so there are auditors that already have setups that allow them to easily browse and consume results and checklists out of this right. So it let's just make life easier for both the people in the field and for us because there's already something up there. Right. Another thing is that they already have a community. Right. So it is not actually just used for well, there are there is content for Ubuntu, there is content for macOS. There are other projects not just OpenShift like OpenStack, for example, that has security content here. So we want to enable communities and enhance communities. So that's why we decided, hey, let's tag along in this project and take it into use right hopefully at some point be able to provide more value as well. So another question that we have gotten quite often is why didn't you use OPA or OpenPolicyAgent, which is a project from the CNCF. And the fact is that even though OPA and the compliance operator both evaluate policy, they're very different projects. Right. The OpenPolicyAgent is a policy engine. So it allows you to in an abstract manner evaluate policy that most likely than not you're going to use in Kubernetes for your admission controllers. Right. So you're going to be able to do authorization decisions. The compliance operator evaluates compliance policy, which is very different. Right. So what we want to do is do I comply with this framework and give me a result about it and help me get there. Right. So in reality, they're quite different projects. And we do see a world where the compliance operator could check that certain rules that OPA expects are present in the cluster and does allow you to comply using those rules. Right. So we view them as projects that can coexist, converge and give mutual value. Right. They're not mutually exclusive, although, and we do different things. We also have gotten a lot of questions about, can we use this in more previous versions? Right. Like for that three. And unfortunately, the answer is no. Right. Due to one, the testing due to the bandwidth that we have, but three is just that we are using a little bit newer APIs that didn't exist in for that three. So that's unfortunately not going to be possible. What about well, even though the design that we have is fairly generic and you could possibly run content for well. Again, the fact is that mostly we've done testing in CoroS. Right. So that is what we're supporting. And finally, does the compliance operator make us compliant? And the answer again is no. Right. Well, there's a lot of stuff that you can automate with the compliance operator. As Jackie mentioned, some content is still being created. And even when we have that content ready, the fact is that there are things that you're going to have to do by yourself. For example, we cannot force you to use a specific identity provider and to enable two factor authentication in that identity provider. Right. That's up to you. Right. That's just one example, but there's many more. And so there is more to it than just an operator for compliance. Right. And what's next? There is still a lot of work to do, mostly content though. So we are going to be very busy writing content. We are looking into the CIS benchmark. We are looking for more content for Fed ramp and there are more profiles out there. So we're going to be doing a lot of that. Another one is that we are working a lot with the advanced cluster management team. Right. So to enable the compliance operator there. So we already are able to create policies with Rackam that will trigger policies from the compliance operator. Right. So and give results to Rackam. Right. So you are able to say, I want all of my managed clusters to be essentially compliant. And you'll get results for it. That is possible, but it is not very granular. Right. So it'll tell you pass or fail, but it won't tell you exactly what are all the checks that failed. And that's something that we're working on. Right. So that'll come in the future. And finally, we want to be also deployed by default by Rackam, which would be great, but we're not there yet. And that's it. If you have any questions, we'd be happy to answer them. Thanks a lot for joining our talk. I'm sorry if the demo got a little messy, but we really appreciate your folks getting interested in this. Thank you so much. We're going to stick around for the questions though. If there's any. We do have a question right now from David Duncan. Do we have any specific compliance targets in mind for the first support? Right. And we are going to be looking into CIS and FedRAMP. Right. FedRAMP is a huge, huge compliance benchmark. So that'll take a while, but CIS is not as much. So that's what we're going to be looking at first. Unfortunately, that's not going to be available in the first release. But as we mentioned in the presentation, you can get updates out of band for your content. So at any time that there's newer content, you can just fetch it with the operator. And that'll just come. It'll be fairly seamless to that extent. So let's keep my hand on time. I think we don't have too much time left. There is a link to a breakout room where we can continue discussion if more Q&A questions come up. Or you can continue discussing in this drive chat as well, which one you prefer. Thanks a ton for your time, folks. This was really interesting stuff.