 Good afternoon everyone. Welcome back from the lunch break and welcome to this talk. Okay. Welcome to this talk which is about misconfigurations in hand charts and how far are we from automated detection and mitigation. And today with me there is a guide. Hello everyone. My name is Agate Blaise. I'm a research engineer working at TALES and I'm mostly working on the virtualization of network systems and the security of such systems. And I'm Francesco and I'm doing a PhD at the Free University of Amsterdam. And very brief advertisements before we dive into the technical content. This work was actually part of a collaboration in a European project. AzureMOS just finished and a new project, I put the link there. So we're always looking for collaborations, especially with companies. If you're interested in what we are doing, please check it out. So now diving into the content of this talk, which is misconfigurations in the cloud, why should we care in the first place? Well, because from recent reports, they found that such misconfigurations can be the reason for several incidents, security incidents. And because the containers that actually run your applications, they also usually run for a very limited time. So the runtime window for detection and mitigation is short. And so at the beginning we started this work saying, okay, because we have this very limited time window at runtime, let's see how much we can improve before reaching runtime, so at static time. And also the other reasons were because there are a lot of tools that can analyze configuration files, hand charts as well. However, there are some inconsistencies between the output of such tools. And also there is no indication about whether a configuration might break the functionality of your application or not. And because misconfigurations can cause bad things, so to say, there are some frameworks available that give a list of best practices and security recommendations that you should follow with your configurations. For example, the CIS benchmark or the NSA and CISA hardening guides. And they, for example, suggest that you should not run your container as root, minimize the capabilities, the Linux capability you assign, and so on. And looking at an example, on the left of the slide, you can see a snippet of a YAML file that can be used to deploy a deployment resource on a cluster. And as you can see, there are some misconfigurations that tools are able to detect. For example, there are no resource constraints for this deployment, which is also running as root. It's not using a read-only file system and it can allow to escalate, it can gain more privileges than the ones it started with. So using tools, you can actually detect these issues and eventually fix them. And some of them are quite easy to fix. For example, it's a matter of flipping from false to true or the other way around. Other, they require you to actually write a new piece of configuration. They require you to add some limits and requests for your memory and CPU of your application. And so you can define, for example, memory equal to 64 megabytes, which seems reasonable. However, whenever you are asked to add a new piece into the configuration, you can also make a lot of what-if questions. For example, what if we define memory equal Alice or what if we define memory equal zero? Will this new configuration be still accepted by a tool or will still be able to deploy it on the cluster? So this was our starting point for this work. And then we selected some several open-source tools available out there, seven of them. Most of them are part of the SCN-CF. And as you can see, they all have a different number of policies available by default. So the numbers are only default policies that can be applied on configuration files or Helm charts. And to evaluate these seven tools on Helm charts, in this case, we built an automatic pipeline. It's a pipeline has six steps, you can see. It's freely available, so you can scan the code, you can also follow the link. I will show a brief demo of the pipeline to check it out and also to try it. And I will explain briefly the steps. So there is a step zero. It's mapping the policies of the tools. I have a slide later for this. So regarding the pipeline, the first step, it takes as input a configuration file. It can be a Helm chart, but also a YAML file that you manually have written. And it runs a tool to find what misconfigurations there are in this file. The second step, we automatically fix all these misconfigurations. The third step is just a debug step to make sure that actually the fixing was done correctly. So the output of the first tool will be zero. No remaining misconfigurations. Step four, we generate a functionality profile. Agat will explain this later, more into detail. But basically we generate a list of the functionalities needed by your application. Step five, we add back these functionalities that could have been removed during the fixing. So it's an update of the configuration file. And finally, we run another tool to find the remaining misconfigurations. So I will show briefly what the pipeline looks like in practice. I hope you can read it. Yes. So we implemented as GitHub actions. So this is our GitHub repository. And as you can see, there is all the list of folders. Every folder is a hand chart. However, we could also take as input a YAML file, which is not generated from a chart. And then we define the pipeline as actions. So we can move to the actions tab and we can run a new workflow that will allow us to actually scan the file with two tools and see what misconfigurations there are. And to do this, we can specify the file. In this case, we use the MySQL chart, the first tool. And then we can specify another tool, for example, X. And then we quickly run workflow. We upload the page. Yes. And now it is here running. You can also configure these to be automatically executed. So this workflow will be automatically executed whenever there is an update in your configuration, so to avoid this manual interaction. And without waiting for it to complete, we can look at the previous workflow we run with the same output. And here you can see the same steps that were back in the slides. And zooming in, there is the first step one. Running check-off, in this case, fixing the chart, running check-off again as a debug, adding the functionalities needed, and finally running kicks. We can further respect the output. And for example, for the MySQL application, MySQL chart, check-off found 21 misconfigurations. For example, the default namespace is being used, which is a very well-known bad practice. So you get a sort of output like this, which, of course, you can also save locally and further inspect it. There is step two, again. We fix all these misconfigurations, so in this case, 21. And then we run check-off again just as a control. And we can check that. So now check-off run. And here you can see there are no more misconfigurations. However, by removing all the misconfiguration, we could also have removed some permissions that are actually needed by MySQL. And for example, because it's a database, you can imagine read-only file system will not work in this case. So that is what we do at step five. We compute the permissions needed, and then we add it back into the configuration. And then at step six, we run, in this case, kicks. So we can also inspect the output of this other tool. And kicks actually find more additional misconfigurations, for example, request storage and also the read-only, which, however, we cannot remove because it's actually needed by the container. So switching back to the presentation. Okay. So step zero, I was mentioning at the beginning, this happened before we built the pipeline. And it was needed because all such tools have different policy types and definitions. So we needed a standardization of the all policies from the different tools. Also because there are some policies in common, some policies are equivalent on the description, but they actually check for different configuration keys in practice. And also there are different output formats. So if you want to count how many misconfigurations you already have to adapt to the output of each tool. So you can see the table show you some examples of what policies are in common between all the tools, and then policies in common between two of them, and so on. And now we leave the floor to Agat to explain a bit more about the functionality. Yeah. So Francesco just introduced different policies that are under by checker tools like checkoff and detree. And actually each of these tools output hundreds of recommendations to fix the end charts and make the configuration more secure. So for example, for this example, you should put this container as non-route. But however, we notice that every tool will have different output formats, and therefore this output must be manually passed by administrators. Also sometimes the output can be quite long to pass with several thousands of lines to pass. And another example is when we will remove a functionality, it may break functionality. So for example, if we consider the Falco monitoring tool, it actually needs access to the host network and to be privileged, even if it is not secure. So it will raise many alerts from checker tools like checkoff, but it is actually needed because it needs to analyze the network traffic, the system call from the containers, and then we have to design one tool that will automatically identify the minimal set of permission that will be needed by the chart to function. And the permission that we cannot remove because otherwise the application will not function anymore. So I will now show you the tool that we developed. It's named the functionality oracle, and it enables to find the minimal set of functionalities that are needed by the application to function, following the principle of least privilege. So as a first step, I will choose one end chart. It could be also Kubernetes manifest or JSON file, YAML file that you design with your application. And you will run it in its default configuration. So we have the permission granted by default as defined in the end charts. And you will have what we call the ground truth container because it contains the given permission allowed. And in parallel, you will consider a list of permission. So this permission will be outputted and will be recommended by checker tools. For example, you can see there is one recommendation to set the user as non-route, also to remove this given units capability or to disallow the privilege escalation. So the second step is to update and run the pod configuration without one given permission. For example, I will set the user running as non-route. So this is step two. During step three, I will recover some indicators linked to the ground truth containers and the test containers. So first of all, I will collect and clean the stream of logs related to the ground truth container with the given permission and from the test containers without the given permission. So I will collect and clean this stream of logs. And for the test container, I will also recover the status from the pods, the containers and the props. So this is step three. And as a final step, I will test the functionality of the pod without the given permission. So this is step four. I have three functionality test cases. I have first TC1. So TC1 is to check that container can start without any errors. Then if TC1 is okay, I will check TC2. So during TC2, I will look to the liveness and readiness props of the pod and check that everything is okay. And finally, during step TC3, I will compare semantically the logs from the ground truth containers and from the test container and verify that the application behaves the same way. And therefore, if these three TC1, TC2 and TC3 test cases are okay, I will deduce that the permission was not needed at the end. If one of these test cases is failing, I will flag the permission as needed. And I cannot remove it as recommended by checker tools. So I will repeat this process for each permission and input in the charts. And finally, I will output what we call the functionality profile that you can see on the right. So it's a JSON document where for each permission, I will say this permission can be removed and it will make the configuration more secure. But this one, I need it for the application to function. So some example of functionality test case. So we have a dataset of 60M charts that we analyzed. So I will give several examples for each of the test cases. So the first one, remember, was to check whether the container can start. So in a Kubernetes manifest or in a M chart, most of the time, we define a startup command along the container configuration. And we will check that this startup command can run successfully and without errors. So one example in our dataset was for the PLM charts. The startup command consists in sending a call request to the PL web server. And this command will run successfully only if the web service is up. So it's one way to verify the functionality. Then if this step is OK, I will go to TC2. And this time check that the container resource is in running state. So especially we look to the resource, to the pod or the deployment container and look that it runs, it is in running state. So especially we look to liveness and readiness props. So they are very useful because they can enable to test basic functionality of the application. And in most cases, they are already defined in the M chart that we used. So for example, the MySQL end chart had readiness props that will use the pink functionality from the MySQL admin executable. So with this command, I will automatically check that I can access the MySQL service and it is accessible with the root passwords. And finally, if these two test cases are OK, I will compare the logs. So first, I will clean the logs, meaning that I will remove any numerical value and any context-related information. And then I will semantically compare the logs to see if the application behaves the same way than without the permissions. And in addition, I will also check that the logs do not contain any keyword in a preset blacklist like permission deny, operation permitted, or error. For example, we had an example with the Datatog end charts. So for this one, the container could start correctly. The liveness and readiness props were actually OK. But when looking to the logs of the application, we can see that we actually modified the UID value. It enables to prevent the container to access host files. And when I modify this UID value, I can see an error, permission denied error because we cannot change the owner of the off-token document. So this is one example of the output of the tool. It's called the functionality design profile. So here is an example for the Falco end charts. So you can see the name of the end charts and the container name was Falco. It was the main container. And we will have a list of functionalities. So here we have only one example. It's Func21, but we have underwrites of these checks. And for example, here we check that the container should not be privileged. It's recommended by a mod checkers. But we can see the value is set to false because the checker, so our tool, the functionality Oracle, flags the permission as required. Otherwise, the container won't start. So we can see that TC1 actually failed. So I will now show some findings on the dataset of the 60M chart that we used. So here are the names of the charts that required the highest number of functionalities. In particular, we can find Falco with eight permissions, Longhorn with seven permissions, and finally RabbitMQ and Promtail with five permissions needed. So for example, for Falco, it requires... We cannot change the ULD value. Otherwise, it will fail, and it's the same for all of the other charts. The same appears for changing the GID value. We can see also that the Longhorn pods needs to be privileged, and that Falco and Longhorn's end charts needs to get privileged escalation. Also for each of these charts, you can see the number of Linux capabilities that are needed. So by default, containers are granted 14 capabilities, and we check with the same process which Linux capabilities were actually needed to identify the chart of function and which one from the 14 we can actually remove. So on average, for one chart, two.21 permissions were actually required for each container. So it's a good result because it means we can actually deny many permissions to make the configuration more secure. And in terms of Linux capabilities, for the non-privileged containers, we can see that on average, only 1.16 capabilities are needed out of the 14 granted by default. And for privileged containers, they are granted 14.1 capabilities by default, and on average, only four of them were needed. Another result, here, we ask ourselves which are the functionalities that are needed most of the time. So you can see on the left the results. So the first one is changing the high UID value. So to prevent containers from accessing host files, most checkers tools ask us to use high UID value, but actually it makes the configuration fail for 26 containers. The application will not work anymore. Then we have the same for the using root user to put the file system as read-only and to unmount the service account tokens. So we also saw that we have a limited set of permissions that makes the functionality breaks often, so it can enable to weaken the process to first focus on these limited sets. And we did the same on the right to see the required NUX capabilities that are most likely needed for the application to function, and the top one was NetBind service NUX capability that is required for seven containers. So I will now leave the floor to Francesco for the rest. Yes, so for the misconfigurations instead, these are the 10 most common misconfigurations that we found. Again, it was on the dataset of charts from ArtifactApp. So of course, if you run it on your configuration files, the results can be different, but we found that dangerous cluster roles, memory limits, missing memory limits, and using the default namespace are the three most common misconfigurations. And in terms of tools instead, we also found that these seven tools perform significantly different, and on our dataset, that tree was the one that found most misconfigurations on average. Cubel Inter actually has no policy that breaks any functionality, and finally Kix is the one that found most remaining misconfigurations. That means it is the tool with most unique policies. So it was actually the one finding a lot of misconfigurations that other tools could not detect. So before we conclude, a list of recommendations from us to you when you deal with configurations files. We really recommend to scan your configurations, but first, you should create a framework to standardize the policies of these tools because there are several inconsistencies and if you don't do that, then you can have false negative results that are not shown in the output. So first, standardize the policies of these tools and also have a clear definition of what is a misconfiguration in your environment, what is a mitigation for it, and what is instead a functionality because in my environment, what can be a misconfiguration can be actually a functionality in your system. We also recommend to use more than one tool and also to define custom policies exactly for the reason that every environment is different. And finally, as Agat was also showing before, there is a very limited set of functionalities that you need on average. So we recommend you to start from those and then move on instead of looking at the whole set. So last slide. Answering the question of the title of this talk, how far are we from automated detection and mitigation with configuration files? I think we are still pretty far, unfortunately, because we found that such tools, they still require quite a significant amount of manual work to both understand the output ends mostly to actually remove the misconfigurations. And then we also found inconsistencies in terms of a configuration that satisfy one tool but do not satisfy another tool. And finally, some false positive and negative results from the tools. So there is a long way ahead, but I think I'm optimistic. So I think standardization is really what we need as a first step. And I think as a community, we can start from there and have more secure configurations. That was it. Thank you for listening and joining.