 Good afternoon everybody. So my name is Mathias. I'm working with Ben at Armo which is a security startup specialized in communities scanning and handling and yeah today we will talk about Sec Comp. So I will I have like a few points to introduce what it is and then we are really doing this conference to get feedback from users whether you know it and if you know it if you use it and if not trying to understand if it makes sense to implement some capabilities into Cubescape or just if nobody cares. So what is Sec Comp? So Sec Comp is a feature of the Linux kernel since 2.16 something like this. So very old and it enables to block Syscalls from the user space to the kernel space. So you can also configure it to just create warnings but usually we want to block it. So why is it good to use Sec Comp? Well you can protect your kernel from malicious apps because let's face it the Linux kernel has like dozens of Syscalls and few of them are used by everybody and then they are like all those special management ones that are never used and since they are never used they have issues and people can abuse them. So usually it's a good idea to restrict them and it's mostly harmless for most of the applications. So who learned about Sec Comp for the first time right now? Did everyone know about it before? Okay there are a few hands so that's good. So somewhere we have done something good that people learned Sec Comp for the first time today so that's great and it's been one of the main things in terms of isolation for containers. So it's a really important topic and I hope we can... Yeah it's not only from containers I mean usually it's for regular processes and containers are special cases of processes. So now if I can circle back to Kubernetes, so Sec Comp has been available since 119 which means more than three years ago it was August 2020 and we haven't seen that much of an adoption. So usually what you have available is like few built-in profiles that are available in Kubernetes like mostly the default one and there is also the unrestricted one which is not a profile and when you create your pods yeah you can apply these profiles to your application but the default one is like way too general. So what is good is usually to write your own profile but then you run into like problems of how do I take this configuration which is a Linux configuration into all my nodes how do I like synchronize and distribute these profiles and Kubernetes doesn't provide any mechanism for this. So maybe that's why people are not using it but that's all I had for like an introduction now I would like to know if any of you is using Sec Comp in Kubernetes. Okay you want to talk more? I can come over there if possible if not. We're using it a little I think it's the mandate now is you have to have the runtime default but we're not making our own and I don't even know how to make our own in our current environment. Okay are you using like a plain Kubernetes or like a special distribution like OpenShift or EKS? Okay great. Yeah just to continue on that topic I remember there was a feature added called Sec Comp default and by default it wasn't enabled in Kubernetes so then there were three different profiles that were created which were sort of built in in some ways and runtime default was the default for the default Sec Comp feature and that's been available for the last few widgets for folks to use I think you're probably using one of that and there are some there is some very good Kubernetes documentation and some blogs where people have said like if you want to create your own Sec Comp profile how you could do it and when would that be useful versus using runtime default so that's something to maybe look at and consider later on. Any other questions or maybe why people haven't used Sec Comp? Okay you have more to talk. So we've started to roll it out in small like places the one where we're like nervous about is like how or yes how do we like know you know we make this change and apps start failing or failing more than usual that the cause is like this change mostly because like not there's a lot of stuff going on in non-prod and a lot of things are just broken by default so it's hard to tell like this is my fault or is it somebody else's fault so what's kind of like the recommended way to check for that. So yeah good question I don't think you can start Sec Comp in just warning mode by default in Kubernetes I'm not sure about this so that would be one solution but what we had in mind actually with Cubescape is to put some instrumentation and check and list all the C scores that an application is doing during a normal workload processing and then gather this list and help you define profiles help you distribute those profiles and then enable for the pod so we wanted to make it like more in an automated way but it's it's a lot of work and I don't know if people are interested into this because how I see it it would be nice to have the probes deployed in development you run all your test in in development and then you generate profiles and you distribute profiles in your testing and production environment so that you're not mixing them and yeah that's how we wanted to do it. So the assumption here looks like is the dev is actually equivalent to production and whatever you learn is probably going to match with production I don't know how much is it true for how many of us that development environment matches production so that can be one of the challenges I have one more question. Great and I'm actually going to answer your question because we did a bunch of digging and research looking at this not specifically with Kubernetes workloads but with more general Linux applications and you know using them to do things and 99% of the time roughly you will stay within the common things you saw in your testing or your other environmental runs but what that means is your application breaks all the time. If you want to use it for logging or something like that it's fairly useful but there are enough corner cases related to weird stuff that happens in situations with timing with some error handling case getting triggered with other stuff like this that you'll I think very quickly run into problems or you'll be extra permissive and if you're extra permissive then you're not doing yourself a service. I'll go to him first and then you. Yeah I mean that's the same with network policies right when you write network policies you're based on like a default workload and once you apply them you don't know if you need to add more because you don't see when they are blocked because the traffic that doesn't exist you don't see and it's the same with syscalls so yeah actually there's no good answer for that yeah yeah I think it's I want to add on to that because I totally agree in the sense that I think it has to do with the fact that it is a syscall level policy and I think that like something like a network policy I think is more easily kind of you kind of understand what like network calls your um parents try and make but almost every app now is written in a high level language like very few people are writing C and directly calling Linux syscalls um and like for example like I pulled up the container D like default runtime profile and you see like a ton of syscalls that are like allowed and then everything else is disallowed for example and like some of these like you've never heard of them uh because you know like you said earlier there's so many Linux syscalls out there uh and I think just like the difficulty of working at a syscall level like profiling is very difficult and um like I said earlier uh you know going through every case like testing is a difficult problem profiling is a difficult problem and I think that's like one of the big things that's just an issue with any syscall based system uh not just seccomp so I think we have to figure out like I think just if there's a better way to profile things and just kind of be reasonably certain that it's not going to break things in a weird way more people will be willing to use it I think very well made points uh I'll come back to the gentleman at the back this is really just more of a comment so uh for our given use case we were looking at seccomp for runtime and what we found is whenever that was turned up with policy we just had unexpected application behavior and our um what we believed was happening is that underlying error handling that is not part of seccomp basically was in conflict with the unexpected behavior of not receiving a system call return so when so just a recommendation for anyone who is looking at this is to start with observability first put it in turn it on for logging see what the environment is doing be cognizant of the error handling that occurs around whether it be uh well just error handling in general because that is probably going to be the one of the largest conflicts in in reasons why your application has unexpected behavior so once you have observability then start to introduce policy in a step-by-step basis whether it be uh you know group it together by function or or um you know related to how your application is working any other thoughts okay I see one so as much as like the narrow allow list is nicer for truly least privileged the other option is like just denying known dangerous calls or limiting things like you haven't tested like and I guess a related question there is I'm actually kind of curious how consistent some of the default policies are because I think there's some variance like is IOU ring allowed in policies or not it notoriously also doesn't create compatibility.com but some policies allow it and some don't such as I think Docker for example disallows it allows it in the default policy but the podman set does I suspect that's used by cryo also I'm curious like is there consistency there or people using it to disable IOU ring out of caution like google us now yeah and I think some of the environments anyone want to share experiences about IOU ring I think it is probably worth to take a step back for folks new to seccom what runtime default is and how kubernetes works with container runtimes and how are we inheriting the seccom policies from the container runtimes anyone want to try and answer that there's you can also answer of course yeah the answer is I don't know I should know because I'm working in the kubelet most of my free time can you just yeah passing the mic because it's recorded so yeah so the two questions there is I don't know are they consistent this would be the first question which is some risk to like are you writing a generic open source app you want to run on all runtimes of generic clusters you want to you want to fit in the common defaults so that would be like I don't know is the ecosystem consistent because I know there's a there is at least some variance in some of the runtimes and then the separate question was yeah I'm just curious if people are using it what what kinds of this calls are people frequently like out of caution disabling I I can share I can share some history that I know and folks in watching the recording or in the meeting room can correct me from what I remember docker was one of the first container runtimes that became popular and the second filters that they developed were tested across multiple containers in docker hub and the idea was if those containers are not failing there is a good chance that the default profile is good enough and then eventually as uh Kubernetes became more and more popular there were more container runtimes uh container d cryo and then there was a container runtime interface uh that also was created so today how Kubernetes kind of interacts with all of them is Kubernetes in itself uh when you say runtime default it's going to identify the container runtime that we are using on the worker nodes so for like a quick summary of Kubernetes architecture very high level there is a component called kubelet that works runs on every worker node and that's kind of like a local agent for Kubernetes API server and all the schedulers and controllers kubelet will talk to any container and time that is available where they are able to understand each other and based on that when kubelet gets a request saying I want to create a pod kubelet will talk to container runtime and say like okay I want to create a pod which has this container I know you can start containers and create ones please do it so in theory and I might probably got some things wrong happy to for folks to correct me this is how it works and when the container is actually going to spun up as a process that's when it will know which uh second profile is listed in the container runtime and then as the container gets started as a process it inherits that profile and says okay I'm only allowed these calls and everything else I'm not allowed so that's basically how the whole kubelet container runtime second default things work to your question about like whether it's consistent across run times probably not and that is another thing to consider as an as somebody who manages clusters is if I'm gonna change my Kubernetes underlying infrastructure is my container runtime changing and if it's changing there is potentially unknown behavior that could happen in applications and maybe once this call that you assume was blocked is not blocked anymore and then there is possible exploit that was not possible before which is not possible so lots lots to think about but for folks who are new I just wanted to give some context so we can kind of go at the same level and go from there any any other thoughts questions is container based or node based yeah I think you you answered it right if I remember right you specific you pick the name of a policy per node per container in the pod spec but the actual JSON of the policy is typically installed on disk because it's part of the the runtime configuration so each node needs to have the configuration installed on it which one of the things you mentioned is you need to distribute this policy to all your nodes or part of updates or part of node installation but yes run time when you schedule your pods or deploy them you need to you need to pick from the list you've already installed which is tricky create a profile which only control which sees call my container can access okay there's a default policy by default which usually there's a default policy by default which most of them look close to what docker started using a long time ago even though you're not using docker likely which allows a pretty broad safe supposedly list of things that denies some definitely privileged things but yes you can write your own least privilege narrow policy if you'd like or add things back that are missing from the docker policy like I think notably when I ran into personally was um I believe I was trying to use username spaces inside of a container and the docker the defaults I got policies were disallowing it and I had to add it taking too much time if this is the this is the case since our company use commercial runtime uh monitoring we use uh polo auto twist lock and it's expensive um we also go open source um I I have investigated Falko sponsored by sysdick it is open source and we also try to come up with the uh second profile I just do a very brief investigation and I found out it's because we are cloud provider um outage is is we have to pay a customer if there is outage it is impossible for us to train the container and generate the profile so this I come up with idea I hope um you guys can give me an answer is that possible that the open source uh sponsor can provide some simple profile from the container for example we at the first beginning we only want to continue not run in privileged mode not uh run in the host process ID and not have um privilege escalation privilege so if um you guys can provide us profile which just prevent these options we can first enforce um these kinds of control and later we can add more and more um anyone want to take that okay cool uh I'll just summarize for everyone uh there are different ways to control and isolate containers uh not running as privilege not allowing to escalate to privilege not running as with host network host team space and stuff like that is there a way to control that and also then use sec comp as an additional control so just in response to that comment that's kind of where we um where we are right now is where we have divested kind of the runtime control likes that comp which actually does this call intercept we're investigating the use of falco just to have observability so that we can have minimally enumerate any given application what it's actually calling um and we can use falco in conjunction with other terms to do kind of remediation after the fact should a malicious sis call come through however that doesn't offer you kind of the preventative control that some commercial carriers need in their instances but in development that's a good strategy to have is to again use something like falco to have observability first so that you enumerate out all of your given sis calls for an application second start with your sec comp default profiles just to see if it breaks uh you know uh kills containers creates abnormal application behavior and then slowly introduce policy on top of that actually that's that's exactly what we had in mind uh into keepscape except we don't use falco use a inspector gadget but it's the same um i was maybe first the question then i i was actually just gonna mention inspector gadget um so we're using that um and i guess caveat we haven't turned on the sec comp feature yet we're using it for network policies right now um but the idea is that inspector gadget it's open sourced on github and it runs as a daemon set everywhere and then it will just silently collect like traces since for us it's connecting network traces um and then at any given moment you can just ask it like hey you know i want you to generate me a network policy um you know if you're like this long ago or just based on whatever trace you see for these different labels um and it will do that for you will make you like a kubernetes network policy that you can then just take and do qtl apply or whatever um and so it has a mode for that for sec comp um and i guess part of why i'm here is to have to learn what would happen if i went and turned that on um like what would break or what might i expect from there um but yeah you couldn't maybe use that or we're gonna try to use that um so to have silently tracing like what is any of these containers doing and then in in the future we can go in there and look and say okay that one was doing this i was doing that generate me a sec comp profile to to take with me and then i can either like add or modify or take away or whatever i think that's like uh like profiling is a very powerful feature of course but i think um the biggest issue is just like 99% of the time ideally like your application is running well it's running as expected nothing weird's going on but like that one percent of the time where you have errors or you have like weird paths being taken is super important and and if you have um you know like you turn on sec comp based on some profile you did for a month of like you know a good performing application all of a sudden it gets an error and then it can't like you know do any error mediation logging whatever you want to do there like that would be a very serious issue so you'd have to be like very certain that you've tested like thoroughly before you can like be confident in that profile and i think that lack of confidence is one of the big things that like slows down at um slows down an option of sec comp yes you have to choose either be resilient or be secure one of the things to consider to consider too is the ability of your application to auto scale or for air recovery the more resilient your application is the more you can't the more enforcement you can have in your sec comp policy however if your application that isn't resilient if it's kind of more in the pets versus the cow then you know you should really be having a very uh a very light sec comp policy i also had one question from developer experience perspective when i've seen sec comp blocking something for my application i get a very unuseful error something like operation not permitted or even worse and as a developer it can be hard to know like okay is this something i did in my application is this something in the cluster is it the runtime that's giving me problem any ideas from people working on this and using it in terms of how to help developers know this is not you this is cluster managers or you need to go to talk to kubernetes admins any ideas or experiences okay and he has one cheers it's a bit of a complex problem because the point of sec comp is to block those sys calls in the kernel so passing information back into user space is potentially or considered originally potentially leaking data um there is something from christian browner who's one of the canonical guys who's done a lot of work around this called the sec comp notifier so it's a way of kind of getting an out of band actual response back from the what might be a segfold sometimes for uh for the kernel um but yeah it's kind of intentionally opaque and it's really not a very nice feature for a user um yeah sorry that's not a very productive answer yeah i mean it's it's fair in some ways like there is a good rationale for both so that makes sense any thoughts anything else anyone wants to share yeah maybe yeah uh i wanted to say there are also other ways of blocking sys calls like if if you take the the approach that google had when they started the g visor for instance they they intercept sys calls not at the kernel level but as between userland and kernel land i don't know if it's more effective it definitely has an impact on performance but uh yeah i don't know if anyone is using other solutions to block sys calls so that was my question yet to putting those in prod but yeah the other thing i'd compare g visor to is actually just using a vm base solution which sometimes quite difficult in cloud with some of them disabling nested hypervisors but um yeah kata containers in many ways i would actually compare to g visor more than anything else and they're different trade-offs yeah my understanding is g visor has overhead for sure but it is lower fixed overhead so if you like it doesn't require tons of memory and things like everything's a little bit slower but it doesn't have tons of overhead whereas a vm has a moderate fixed overhead but it's actually quite efficient once it's running as opposed to stack op which is mostly free because it's already in the kernel but it has limitations uh actually since the g visor was mentioned i'm curious uh from the monitoring and tracing standpoint for example if you're using ebpf for like stack tracing and g visor intercepts the call what actually you will see through the entire stack what will be in the ebpf because g visor is essentially fakes the call and fakes the interception on both level it's in user space in this kernel space as far as i understand so i'm just curious if anybody tried to run ebpf along with g visor or any other tracer of the syscalls and how the entire stack will look like i haven't performed that specific test but my understanding is is that the emulation of those syscalls is actually done with a really small subset of total syscalls well that that's the set that they emulate but i think there's um there's a diagram that i can't it's on their website somewhere it's probably also worth saying that um google rolling back g visor because they they had it around a cloud run because cloud run was originally running on borg and they were like well see it's not safe enough to expose to users for Linux kernels not sufficient for our use um which is why they rewrote it but then they've been just weird edge cases and uh certainly the i think cloud run v3 is the latest incarnation that they're doing it just doesn't use g visor so that on the one hand but the other is there's a triangle graphic on their website somewhere which shows the number of syscalls they support and then the ones they actually go down to from an actual system called level and it's not very many so i think you'd see very confusing traces because they wouldn't map um yeah again immediately helpful answer but perhaps not we got about couple minutes more uh i kind of want to end it on a non secomp note so do you answer your question right secomp is one of the many ways we can continue to isolate containers and parts in our cluster i think three of the things that you mentioned are already good ones that we can continue to use so i think that is always important to consider in perspective like yes definitely use secomp it's helpful it does a lot but there is to your question like have you played around with the recent port security standards that were made part of the cluster before i think in 125 and plus yeah so that would be something of something what you are looking for where it has three built-in standards in as part of the cluster that do all the three things that you mentioned and then and some more and they're based on different granularities there is a seccom field also to do the things that we all discussed and that is something easy to kind of do on your laptop and just see if your application works it starts with privileged baseline and restricted i think baseline is a decent start with enough good enough depending on your risk posture so that's where i would say like we can end like yes secomp is great we should continue to explore and push the boundaries to make it better and in some future world when you give a yamal of a pod it automatically generates a seccom profile for all of us and that would be wonderful but we are not there yet so with that