 Hey everyone, welcome to my talk about Secure and Kubernetes applications by crossing custom Seccom profiles. I'm Sascha and it's a pleasure to be here today. What do we want to see in this talk? So the first thing I would like to tell you about is a brief history about Seccom and Kubernetes. So we reflect the development progress of Seccom and Kubernetes and we will see what the current state of Seccom and Kubernetes is. After that, we will craft a custom Seccom profile by hand. For this, we will use a real world example and we will use two methods like tracing the locks and running recording Seccom profiles by using eBPF. After that, we will speak about how we can automate away those manual efforts. So how could it be possible to integrate our recording into a CI CD system, for example, and how we can get rid of all those manual steps in between. And after that, we will speak about the bright future of a per default more secure Kubernetes. So how can we make Kubernetes more secure per default by using Seccom profiles, for example. Seccom is a Cisco interceptor feature for the Linux kernel. So it really works like this. So you want to do a Cisco on your application and then you can decide which action do we want to take with this Cisco. So for example, you have a different list of actions available. You can, for example, say that you only want to lock this action, you want to arrow out, or you want to allow this Cisco, for example. And this can boost the application security by limiting the list of allowed Cisco. So you can maintain a list of allowed Cisco's. You can also maintain a list of blocked Cisco's. And you can also find greatly define what the error code, for example, should be in case your disallowers is called. And this has been added to Kubernetes a long time ago. So we also have a default security profile. This has been defined by the container runtimes, but Kubernetes requires that such a profile exists in every container runtime like container D, Cryo and Docker. Seccom was going to GA, which means generally available since Kubernetes 1.19. So we now can consider this feature as stable since quite some releases, and it also supports most Linux conversions. There are some constraints or there are some environments where Seccom may be not supported. So we also have to take this into consideration that we have, for example, and the decent architecture or Linux distribution with national support Seccom at all. But in general, this Seccom fields are usable by a native field under security context or a deprecated annotation. This can be done by the container or by the parts we can provide profiles to parts which then enter to containers or we can specify those profiles on a container level. And the overall goal is to remove the annotation support in 1.25. I also have to mention that all workloads run unconfined by default, which means that Seccom is disabled for them. We have a special feature introduced in one of the recent Kubernetes versions, which is called Seccom default. And this allows us to change that behavior. For example, by using the default profile for all workloads on a specified note, I have to mention that this feature is alpha for now. So it's not enabled on every node per default. And one drawback is that default profiles may differ between container and time. So it may be possible that if we have clusters with mixed container and times, then it may be overall behavior may be different between one node and the other. Those custom profiles can be defined as JSON files. So there is two main issues with that. So first of all, they have to be distributed to all nodes to be available for the whole cluster. And the container runtimes have to apply them from disk. So we don't have an automated way in Kubernetes to distribute those profiles to each node and then load them from disk. How can we craft custom Seccom profiles by hand? Our overall goal is now to understand how can Seccom profiles work and which possibilities do they allow. And then how my application behaves and which disk calls it executes during that runtime. And we have to collect a list of disk calls which are required to be allowed additionally. So we have to take the whole cluster setup into account to not create two restrictive profiles, which means that different architectures may require different disk calls and also different architectures allow different disk calls. And the workload configuration has an influence on the executed disk calls as well, which means that having a workload configured differently, for example, by setting some options to some configuration files or something like this may also lead to executing different code paths. And this will automatically lead in different disk calls, which has a direct influence on the Seccom profile. The example project we would like to choose today is kube-arbex proxy, which is basically just an HTTP proxy, but it can perform arbex authorization against the Kubernetes API. And this allows us to restrict requests to the ABI at all and network isolated environments. And it has been developed initially to product from its US metrics and points. So it's possible to add an additional layer of security to usually plainly exposed from the US metrics. And this is a single container deployment, which simplifies this disk call tracing at all. Having an application which consists of multiple containers or even multiple deployments like a microservice architecture would require us to create second profiles for all single deployment and all single containers. And this would be probably too much for this demo here. I link the project here, so you can check this out if you want to learn more about kube-arbex proxy. So how to actually record those disk calls? The first method we can use is tracing the conducts. So this requires order D or this look to be installed and configured on the system, which is something which may be possible across all distributions. I mean, there are many distributions like plain Ubuntu, which only have this look. And there are also other distributions which ship order D in a good configured way per default. Nevertheless, we have to consider that there is a rate limit, not only for order D, but also for this look. And we can set the two disk calls here to disable the print k rate limit and the rate limit burst. Otherwise, we will probably miss some disk calls during the logging of extensive applications. We mainly rely on one single Linux kernel function to record our disk calls. And this is audit second. You can see this function here directly from the latest version of the Linux kernel. And the only thing it does is creating the lock string containing a bunch of information like this is called itself and the architecture, and then printing that to the kernel buffer. But this comes with a bunch of limitations. So for example, this calls will only get locked if they got requested for logging, which means per default allowed and also blocked disk calls won't be locked at all. So generally, logging has a high performance impact. I tested it with a bunch of applications, and it really slows down the overall application because it blocks every time we have to log a line. And there's a special second action available for logging. So I already mentioned that there are multiple actions available. And creating a second profile on this, which is a bunch of JSON for our profiling applications is as simple as this line. So we just have to specify a default action. And this action is second action lock. So which is the actual value here. We also have to double check if logging or log the keyboard log is part of Proxxas kernel second actions log. Otherwise, we would not look anything at all. Now we have to put this profile into the default location for the cubelit to look for second profiles, which is smaller cubelit second. We name our file lock.json. And for the example of cube or like proxy, we'll change the security context of the second security context of the deployment to specify the second profile localhost and then point to the relative path, which is lock.json. Which is lock.json. Now we can finally run our demo application and trace the while log audit, audit log on my system. So we have to be aware that those files also rotate at a common size. It also depends on the configuration of order D and order D itself can also be configured to have a rate limit. We can double check that by running suited order CTL minus S, which brings us the configuration for that. And then the question would be, how can we link the workload, which is currently running to the actual output of the audit logs? So the only thing we can do is we can just, yeah, we could also, we have a process name, for example, process names on a unicorn system, we have to choose the unique process identifiers of the PID. And with Christ CTL, we could look for our container on the local machine like Christ CTL PS and then look for cube or proxy, then we get the container ID. And by using the container ID, we can use Christ CTL and spec and then crap for the info pit. And this gives us the process ID of the workload of the container. And when we have that, then we can obtain a list of syscalls. So for example, if we run pseudocat, while log audit order D log and then look for the type seccom, because we have an audit D multiple types available, but we only are interested in the type seccom, then we additionally crap for the PID. And then we have an automatically resolution of the syscall name, which is written in uppercase here. And if we sort those uniquely and also now remove the new lines here that I can print it out, then we will see all syscalls available or required for running the cube or proxy binary. So for example, we can see we need bind, we need clone. Yeah, we also need socket, for example, because it will actually create a socket and a bunch of other syscalls like listen. And that's interesting, but we only have started the application for now, means that this list of syscalls reflects only the sort of application, not its actual usage. So it's really important to gather the syscalls for all available code paths. So what we now do is we create the client, we created the example client. And this will then connect to the cube or proxy, we'll try to gather the metrics. And then we can see if we do it again, collecting the syscalls that a bunch of syscalls have been added to our list, for example, connect and get peer name. So means that actually using the application will trigger the code paths. And this is crucial for this overall approach. So what we can do now with this list of syscalls is that we create an allow list of possible syscalls for this deployment. So we use a default action of arrowing out. We can also specify the arrow return code if we want to, and the arrow, no, arrow code, like which is ePerm in this case. But we could also choose enosys, for example. And then we specify the list of syscalls which are allowed. So this allowing everything and then having something like an allow list is always the more secure approach, rather than going the other way around. And then we add those syscalls names to our list. And after that, we can use this profile as a news account profile on this. And this is something I would like to demo you now. So first of all, let's double check that my system actually is able to lock anything by using order D. And yes, so the second action is locked skill process, skill thread, trap arrow, no, and things like that. And also the most important part is the lock here. But what we can do now is we can create our profile, which is called lock chasing. And this profile has to be copied into the Molybdublad second directory. Otherwise it wouldn't work at all. So if we now double check our deployment, then we have to ensure that the security context of the container for the cube Arbex proxy specifies a local host profile, which is called lock chasing. And this is already the case, we can apply it. And then we can wait for it up and running, which is already the case now. And now it already should lock some syscalls into our log audit, audit.log. So if we now retrieve the container ID by using tracetylps and export it as like this, then we can also export the process ID by using tracetylx spec like this. And the process ID should be now available like this. So what we now have to do is we can just grab our audit logs. We can do it like this because we have multiple audit logs. And then we have to grab for the type seccom, additionally grab for the process ID. And we can also double check like this, that there are those entries are unique. And here we are. This is a list of this course we have right now. And we can also, we also have to actually use the client. So we have a client example available for cube Arbex proxy, which creates a job. And this job will run to completion like this. So we can just see that it actually works. So the authorization itself against cube Arbex proxy has been worked. And if we now run it again, then we see that we have a bunch of mosescalls. And those scalls can be the base for our actual second profile. So let's modify this log json profile and to specify, for example, the example we have in our demonstration, so this profile now contains 25 scalls. And if we copy it, and if we copy it into our profile.json, for example, and then modify the deployment to point to this profile.json, then we can verify that it actually works, then reapply the deployment again. And we can see that it is now up and running by using our recorded profile. So let's add some thoughts on this overall approach. So creating profiles via the logs can be slow, especially when we consider using it in a CICD-based automation. So if we consider having a huge test suite of end-to-end tests, which already takes a bunch of hours to run, then we probably will further slow it down by using the audit logging. And all nodes have to be reconfigured to not rate limit those logs. And especially gathering all application code paths is really the hardest part here. So we have to ensure that we test all use cases accordingly. And there is another way of doing this by utilizing eppf. So eppf is a technology which allows us to run code inside of the kernel by loading it dynamically. And this supports a bunch of trace points. For example, we have this source calls, this enter trace point, which gets executed for every syscall on the whole system. And even before we want to do that syscall. So this provides us some basic mitigation point for the syscall. But we have to be aware that it always runs in global scope of the whole system. So we have to correlate the information of the process to the container. In the same way as we have to do it with the audit logging. We can use tools like BPF trace, which already allow us to collect the required data. So BPF trace provides its own abstraction language on top of eppf. And for example, we can run BPF trace and then select the trace point, raw syscall, sysenter. And then we already pre-filter for the application name, Kube Arbric Proxy. And what we then print is we print a PID and this is called ID. And if we run an application like this in parallel, or even before the Kube Arbric Proxy has been started, then we can start the Kube Arbric Proxy later on. And if this has been done, if the Kube Arbric Proxy is teared down and we have run all our tests, then we can crap like the PID into the output, which provides us then a list of syscall numbers like here. So there is no automatic resolution of the syscall name, we only get the number on the local system. This is the reason because those syscall numbers can change from system configuration to system configuration, namely the architecture of the system. Now we have to convert those syscall numbers back into the actual name, which can be done by the AU syscall binary, which is part of audit D. So if we dump all syscalls, then we see that we have the number correlated to the actual name of the syscall. So that gives us then the profile, and then we can apply this profile to the workbook itself. It is possible to create second profiles manually without having to write an BPF application from scratch by using this approach, but it would be also possible to create an own EPPF application. For example, there is the golangbinding libppf.go. Then we could correlate the information of the process ID to the container we are using by using the syscru path on the local machine, and then collect the data directly at hook. But this sounds complicated, doesn't it? So my thoughts on this overall approach are, creating profiles would not affect the system performance in the same way as logging, because EPPF is really fast. We would still have to overcome, so we are in the kernel space, and we would still have to overcome the performance throbbing when moving the data like the syscalls and the process IDs back into the user space. So this has a performance impact, but not in the same way as the logging is. And all nodes have to be reconfigured to contain either the custom EPPF application or the dependent tools. So we can't just use plain Kubernetes. We have to use BPF trace or compile the EPPF applications and ensure that the EPPF application runs on the node and things like that. And gathering all application code path is still extremely hard, but there must be a better way in doing this, right? And this is actually the case. For example, the security profiles operator is an operator which focuses only on security profiles for Zekom, Celinux, and AppArmor, and this provides an automation around log and EPPF-based profile recording. And it automatically traces the logs at the right time and then extracts the data, means if I run a workload, then I can pre-define that I want to record this workload, and then the security profiles operator will take care of tracing the logs and gathering the data, for example, correlating the process ID to the workload. And it can also leverage EPPF to record those profiles, for example, for performance reasons, for example, and this also automatically correlates the workload to the underlying process. And it creates Zekom profiles. So if a recording has been done, namely a workload has been removed, then it will automatically record a Zekom profile based on that workload. This Zekom profile will be represented as a Zekom CRD, so it's easier to handle it. And if we have a Zekom CRD, then it automatically reconciles all those profiles to all nodes. So we can have a workflow where we create a profile and then automatically reuse it within a cluster without relying on a single node cluster. So the distribution of the profiles will be taken automatically, which is really great. How does the EPPF recording and the security profiles operator work in detail? So first of all, we have to specify a custom resource definition, which defines that we want to record a workload. This will happen by using a standard selector. And if the profile recording CRD exists and we create our workload matching that selector, then the security profiles operator will use a webhook to add a recording annotation. And this is the only job of the profile recorder. So adding this annotation will indicate to the EPPF recorder via GRPC that we now want to start a recording. And this loads, automatically loads an EPPF application into the kernel, which doesn't require recompiling the application itself because it uses EPPF query, which means compile ones run everywhere to start the application locally. So it loads the EPPF program, which also uses this is enter trace point. And this program because this is calls for every pit on the local machine. And it throws an event on the PID, which, which will be used by tracking the mountain namespace because the mountain namespace is usually something which is really stable across a container. And these events will be used by the event processor in the EPPF recorder. It tries to get the container ID for the process ID by using the local C group. And then it tries to find the container ID in the cluster. And if this has been found, and it looks for the profile recording annotation, and then it starts tracking the profile for the PID and mountain namespace. This is the overall loop. And then if we stop the workload, then the profile recorder will actually collect the scouts from the EPPF recorder, and the EPPF recorder will take care of automatically unloading itself. So the EPPF program itself runs only during the life cycle of a single recording or yeah, multiple recordings are probably all possible as well. I would like to demonstrate that to you. So if you look into my current Kubernetes cluster, then we can see that the security profiles operators up and running, and that we also have sub manager deployed, which is a direct dependency if we don't have any other certificate provider available in the center cluster. The first thing we have to do, because it's disabled per default, is to enable the PPPF recorder. And to do this, we just have to patch the security profiles operator demon configuration, which is called SPOD. And over I suspect that we enable the PPPF recorder and set this to true. So if we do this, then we can see that the security profiles operator will take care of rolling out itself again with the new configuration. And after a couple of seconds, the SPOD instance should be up and running again. So if we now look into the logs of the PPPF recorder container, then we can see that it does some sort of self check before it actually starts. So it gets the PPPF, it does a PPPF load unload self test, it loads the PPPF module, it tries to attach the trace point. And then if this all is successfully then it unloads the module afterwards itself. And this is great because this already provides us a feedback if the PPPF module is working as intended. So the security profiles operator ships a default example for recording ZECOM profiles by using PPPF. And if we look into this example, then we can see that we have a special kind profile recording available for the security profiles operator. And we can give this a name which is test recording in our case. So this will be later on the name of the recording itself or be part of the name of the recording itself. And what we want to record are ZECOM profiles. So for the recorder PPPF right now, we only can record ZECOM profiles. The logs recorder would also support recordings of Linux profiles by the way. We use a pod selector which matches our labels app equals alpine. And this is the overall indicator for the recorder to match any pod which contains that label will be recorded. So let's apply it. And then we can also double check if it's available. Yeah, our test recording is available here. And what we can do now is we cannot run a workload which is using an alpine image and contains the label app equals alpine. And if the container is up and running, then we could usually do some tests or execute some discos. For example, we can create a test directory test one and also test two. And if we exit this container again, then it will be automatically removed. And after a couple of seconds, the security profiles operator PPPF recorder should also return the profile for the track PIDs. You can see the SH command and the mountain space related to it. And also see that all sub commands, for example, MKD is a sub command of the SH shell. If I run it within it, then it also tracks the mountain space of it and that's equal. And if the result has been returned, then the PPPF recorder cleans up itself and stops the module itself for security purposes. So now the second profile should be available, right? So if we look for our second profiles, then we can see that the test recording alpine profile has been installed and is available on all nodes within this cluster. We can also look what the actual installed path is. So the operator itself creates a namespace operator and then it uses the namespace which is default in this case and then it creates the JSON file for the second profile. And if we look at the profile itself, then we can see that it is an allow list as we already did it for the lock recording. For example, we have a default action which is arrowing. We also have the local architecture available here and the list of allowed syscores which is only the list of really necessarily allowed syscores. And you can also see that we have MKD on MKD add here. Available within this profile. So general thoughts on this approach. So we get mostly get rid of all manual collection obstacles. So we don't have to pre-configure the local node, we just have to deploy a security profiles operator and that's it. But gathering all application code is still the hardest part here. The integration into a CI CD workflow would allow us updating the second profiles with the application lifecycle. So we have to consider that second profiles change when application code changes and this could be integrated into a whole CI CD workflow for collecting the profiles and then distributing them as well by upgrading. Because the security profiles operator can also reconcile those profiles on each node, you can also take care of updating them. On the other side, the security profiles operator could also be used in production to distribute the profiles using the CI CD. Let's speak about the pride future of a per default more secure Kubernetes. So the second default feature should be graduated to beta in Kubernetes 1.25. But this means that also it won't be enabled per default because since Kubernetes 1.24, those beta features are not enabled per default anymore. So graduating the feature to stable would gain us a security boost in Kubernetes. The plan is also to make it API aware so that we have actually representation of the second profile, not only for the kubelet, but also for the end user. And then handling those upgrade paths is probably the most complex part. So we were thinking about upgrading only or enabling the feature only for new workloads and existing workloads won't be touched at all. But for example, downgrading again would be then even harder than upgrading. You can help us by making Kubernetes more secure by default. For example, by using the custom second profiles or by using the security profiles operator or just sticking to runtime default for all your applications. And other than that, it would be really great if you would try out the second default feature in Kubernetes and provide some additional feedback about how it behaves. If you enable second default and everything works out of the box, then you can also choose runtime default for your application as well. And that's it. Thank you for listening to this talk. I really appreciate your feedback. And let's have a chat about this.