 Good morning, everyone. My name is Roberto Sasso, and today I will give you a status update for the integrity subsystem. I will first talk about making IMA and DVM as regular LSM, and I will talk about adding IMA support for the machine key ring. I will talk about an solution for improving the IMA measurement and appraisal called Integrity Digest Cash, and I will tell you about the highlights of the new version of IMA-VM Utils 1.5, and about an effort for improving the IMA and DVM documentation, and some kernel commits that were merged recently. So, originally when IMA and DVM were going to be upstream, there were some technical issues that prevented them from being regular LSM. One of that is that the security framework was already used by Selinux, SMAC, or APARMOR, and cannot be shared with integrity. So, Linus rejected the idea of adding another set of hooks, and also there is the problem that IMA and DVM needs to store the integrity state in the node, but the security problem was already used by the active LSM. Also, DVM, although it can calculate the HMAC on multiple extended attributes, the API at that time allows only to know about only one extended attribute. Over here the situation improved, so the technical limitation that we had before were solved. In particular, thanks to the work of Casey Schaffler, now we have LSM stocking, which allows to call sequentially minor LSM, and also it's possible to share the security blob among different LSM. Also, now the integrity LSM is always enabled and it's blessed as the last LSM, and this is important because we don't have the risk that IMA and DVM are accidentally disabled, meaning that the HMAC or IMA extended attributes are out of sync. And also, DVM needs to be the last because it needs to see all the extended attributes provided by the LSM in order to create some time. The final piece in order to make IMA and DVM regular LSM is that we need to calculate the HMAC on multiple extended attributes, and as I mentioned, that was not possible because I know that in its security, which is the LSM hook that is invoked when a new IMA is created, only passes the extended attribute to fill, and what it changes is to pass instead of the array of all the XATRO. So, DVM iterates over every element of the array and calculate the HMAC. So, when I started to do this work, I said to myself, let's do the most safe way. Don't make any behavioral changes because any change that we make needs to be carefully tested. So, I tried to do a very technical operation, and the first one is to first align the parameters of the IMA and DVM function with the definition of the LSM hook. And the second one is to add the LSM hook in the place where IMA and DVM were called, but there was no hook at the moment. And finally, it's possible to register IMA and DVM as a function as a LSM hook implementation. That's very simple. And as a bonus, now that we have the possibility to share the security probe among different LSM, we also store the point of integrity metadata into the security blob. So, this page set I already proposed in the second version and soon will be merged. So, now I will talk about the machine key ring that was done by Eric Snobart, and this is one of the key points to have secure boot successful because it allows user to load local and third party CA certificate and to use them for signature verification. And there can be different restriction on this key ring and it's possible to have no restriction to load only CA certificate or to load the CA certificate that are used only for assigning keys. And Eric also added the support for loading local and third party CA certificate into the IMA key ring for IMA presale. Finally, Nina Jane added support for extracting these CA keys and called signing key respectively from trusted CADB and from model DB in the PowerVM architecture. Ok, now we'll talk about the integrity digest cache which is previously known as Diglim and Diglim ABPF and it allows to overcome some challenges for measurement and appraiser. And it's simply a kernel-based cache of file digest or metadata digest which are extracted from trusted sources like RPM matters, Debian packages, manifest and so on. And these file digest are used as golden values for appraiser and for measurement and the first requirement for using IMA is that the RPM matters and all these manifest they need to be or measured or appraised and it works in this way. So when IMA needs to measure a file or appraiser file there is a new extended attribute which contains the part of the digest list, the RPM matter. So IMA calculated the digest of the file and then access the digest cache and try to see if the golden value is in the cache. If there is a cache it, no new measurement, no PCR extender and appraiser is successful. If there is a cache miss normal measurement, PCR extender and appraiser fail. So this allows us to solve some of the problems that Matthew told about today. And one is the fact that the measurements are not deterministic because executable execution can happen in parallel during the boot so the PCR value at the end of the boot will be different. And the problem that I'm solving now is that I'm only measuring the digest list which contain the approved values and now the sequence of measurement is fully deterministic and after each boot. Also another advantage is that there is a lower override for IMA appraiser because we can verify only one signature for potentially thousand of file because the RPM matter contains the checksum for all the files. And it's an extensible architecture because we can support RPM matters, we can support the Debian packages and any trusted source that you want. There are a few drawbacks and the first one is that when you do a software update the measurement list changes so you need to see it again the key to a different policy. So I was thinking to use different PCR for the digest list and for unknown files so when there is unknown files that cause a PCR extender basically what we want to do is to revoke the TPM key. So it cannot be used for example secure communication. And finally we lose some information. The measurement list that we obtain is not as accurate as the normal measurement list because we don't know for example which file in the digest list was really accessed and also we don't know in which temporal sequence. So we release the IMAVM this 1.5 and the most notable change is to add the support for user model Linux so we execute the kernel in user space. And this is very helpful because we have a pipelining GitHub action and we tested new kernel patches. And we had also minor improvement like running specific test or updating the testing distro whenever a new version is out. We had a new feature to AVM CTL so we had the ability to sign F as a verity digest because now I'm a preser support it and we can read the TPM 2.0 PCR through CZFS interface. And we also had the new test for the kernel patch that we release over the time. Ken Goldman is also leading an effort to improve the documentation of IMA and AVM for developers so in order to help them to make more useful IMA policies and he is also explaining the content of the IMA measurement list in order to perform a better and more precise remote attestation. There are different patches that have been submitted to the kernel and ranging from VFS, overlay FS, IMA, the integrity subsystem itself and FS verity. And finally I have an announcement and that CentOS 9 will have the CA to verify the file signature so it is possible to enable IMA prison. And we expected that also at 9.3 we support the same. That's it. Thank you. Any question? If not, I have another talk or I can do it in the order that is in the schedule. I do it. I will also talk about SMAC LSM and I will talk about the major development since 2021 and an update about the SMAC maintainer, which project needs more manpower. So the first set of changes come from for the SMAC transmute feature which is the ability to set the label of new files depending on the parent directory and not of the label of the current process. And the problem is that SMAC didn't support well overlay FS and in particular the transmute extended attribute was not set correctly in the directory because overlay FS uses temporary directory and later moved the new file to the final destination. So I fixed this part. And also now I fixed another issue with EVM because the transmute extended attribute was created in the instantiator and the problem is that EVM was not notified about the creation of this extended attribute and HMAC basically became invalid. And the final part is to add support for the SMAC transmute for a system that don't have support for extended attributes. Another set of changes come from supporting the new revision of the mount infrastructure and as a result we have 15 contributors and 27 patches. So KZ left the workforce in April but is still maintaining SMAC as a hobby and is also doing a wonderful job about the SM stocking. I wanted to make it complete in order to support two different major SM side-by-side. There are some projects which need more manpower and in particular the support for IPv6 which was done before Calipso was available. So we are looking for developers that do reverted code and also there is an effort to have a policy SMAC policy for Ubuntu but this is lost corporate funding and we are looking for developers to continue this work. Finally, we are looking to extend the SMAC test suite to add more tests and to convert it to the kernel self-test infrastructure. Thank you. Any question? Thank you very much. All right, so we're going to talk about what's gone on in App Armor recently. There's just a little bit under new mediation that we're going to cover. These are nothing big. So we got a new hook last year for user namespace mediation so we've added support for that and some new hooks around IOU ring and added support for that. Plastic MQ mediation, we had something there but it was pretty coarse and we have improved it. Not that a lot of things use that but we actually had a use case that wanted it just to actually make that work go and we had some community people come in and add de-bus broker support for our mediation and we've been working to make the test suites pass with it and make sure it's up to snuff. Most of the work has actually been under the hood cleanups and not just cleanups but improvements around performance, locking, things that you don't really see day-to-day that they can have a big difference. So this is kind of like historically how the policy has been laid out inside the kernel. We have a whole bunch of different profiles loaded as a policy set and each of those profiles have some data inside of them. There is three state machines and then each of them the attachment which is used for attaching policy to applications. The file and everything else so unfortunately those two have been split for historical reasons. One of the big things there is within each state machine the permission sets are encoded in there and it really limited us on what we could store as permissions. We had 64 bits and every state in the state machine has that 64-bit permission associated with it. So it's more permission sometimes than we need and very limiting. So one of the big changes is we've gone to reworking this and this has been a very slow staging happening. So now we have landed in the kernel that on older policy on load won't remap to this and so now we can start cleaning up kernel code around this and we'll be able to it's not done yet in the kernel upstream where we can actually merge the code that use the file in the alternate DB for both types of rules anyways so that they'll merge and then we can actually collapse a whole bunch of special cases down in the kernel and get our code cleaned up a lot which I'm really looking forward to but not only that it reduces our memory use in the kernel that when we had this and it can actually speed us up the individual attachments we have to walk kind of a list when we have policy loaded together like this on a shared attachment we just have to do one attachment and we find where we need immediately and it doesn't matter if we have a thousand profiles it's just a single walk. So we get some big performance improvements out of this we get reduction in memory and we also get a bigger permission set so we now have here to extend like we can extend it in the future without a problem and we get more features out of this and actually it can take up less memory because it's not every state we just have an index stored it's a small index stored in the state machine now into that and so we can collapse it down and actually can be smaller on permissions than we used to be which is another win, right? Now allows us to do simple things like where before we could say that we're going to kill tasks if they access things and it's denied now we can go down because we have this extended permission set and we can add that extra information on a per state kind of thing and so we can very fine grain specify if you're going to write to Etsy Shadow we'll kill you, right? It's just a nice little extension that we actually pick up with this and we can also store some modifiers like maybe we want to kill you but we're not going to audit it so we're going to be quiet about it I don't know why you would want to do that but it gives you the control to do that which is really nice as well Another one we've been working on a lot is policy development which is a real pain point so our complain mode or learning mode whatever you want to call it is equivalent basically as it is instead so when you turn this on you get a whole flood to the logs unless your policy is really good, right? It's great for small policy changes but if you're trying to develop for something new or something the biggest change like at the start of a development cycle type thing it can flood the logs you can lose messages audit logs dropped or print k rate limits things like that so it's very painful to update so one thing we've added is a cache so that we can reduce and de-dupe these because some of these things it's like capability requests you can get hundreds of them that are the same what's the point of logging all those like that and also it allows us to say instead of directly outputting we can say hold this for the developer and we can attach a tool to it and just dump them all out and redirect them away from the log completely so these things aren't going to the logs now if you don't do that it'll still go to the log but it gives us a way to redirect them away from the logs and cluttering up your system which makes life much better when you want to go look at your logs and this again we go back to the permissions extension again we get finer control so instead of just turning complain mode on for the whole system or a profile in the system we can now get more specific about it and we can say I only want to complain about these things so complain in this case again like audit to allow it's going to say I don't have permissions for this but I'm going to audit it I'm going to log it and allow it so that's what we're doing here is anything under there I want to see what it is I want to get it out so now we can limit it down and still have other parts of the profiling force and reject things which is nice when you're trying to iteratively update a profile and you want to be in a semi-production environment I would never do it in a real production environment but it happens, right? we had some fun with buffers so app armor for some of its mediation for some buffers and to do this we attached a couple pre-allocated the buffers attached them to CPUs so per CPU structures doing that makes them fast to grab and there's no chance of failure your allocations aren't done yet if there's no failure there it's already pre-done when you're set up and you're done it's good, right? when you had a large CPU system that's a lot of wasted memory and not just memory kernel memory, right? and the buffers aren't small they're 8K right now and so if you're thinking like 256 CPU system that's a lot of kernel memory and we're not using them all the time it's only specific mediation type events, right? there's another problem with them, right? since we're talking per CPU memory we are talking per CPU critical sections locking so when we get a buffer that starts a critical section we do our work in the critical section and we put the buffer that can't be preempted that means it's bad for real time so what we had to do is we looked at this about some patches submitted to us to deal with this so that real time could work and also improve our memory usage so the idea being we go to a global memory pool where buffers we allocate several buffers protected by a spin lock have some reserve for cases where we fail for locking and we can't sleep that kind of thing it's good for real time it's good for reducing memory it also has allowed us to start cleaning up some of our more complicated locking in certain places that's not all done but it's good for our locking as well in the code and making the code simpler it's not fast when you have to do that in certain paths we can live with it and we'll get to it but so lots of CPU systems now instead of having problems with memory we have problems with lock contention and it turns out to be a huge issue we can see in certain mediation paths like saying doing a get clone or walking a get clean or something on a tree get GC we can see like a 40 time slow down yeah on these large systems so it just kills performance so this is no good either so we've gone with a hybrid approach now so we have a queue it's a small queue it's just a list essentially on a per CPU basis and then we still have the reserve so what happens is when we go to get a buffer we get it, we put it back that's fine if we need to say start optimizing something we know we're going to use multiple buffer requests we can put it back on the queue if we can't hold it and so we'll get it back from the queue so we check the queue first very small window so it's good for critical section window so it's good for preempt real time then we go to if there's nothing there then we'll go to the global pool and then even when we're putting back on the global pool as we hit contention we'll add it back to the queue instead so we have this dynamic so it scales contention because we just go to the queue and we keep reusing the queue until we see lock contention goes away now we're still playing with the scaling a little bit because when do you put it back do you put it back right away if you didn't have any contention how many times when you allocate a buffer so some of those things like playing with how often you want to do that for performance as well when there's a few cases like I said where the spin lock really hurts so if you want to do something there you can do it in systems not use tons of memory all the time and also work with real time so that's where we're at right now another thing is so unconfined is been well integral to app armor since it's beginning the idea being is we're going to treat the system just like it's DAC and we're just going to get out of your way when you're in unconfined and you don't have to use it you haven't had to use it for a long time you can confine everything you can put profiles on everything but that's not how people do things and we've run into issues where it's just gotten to the point where we're going to have fun with this we're going to remove it and get ready and start putting restrictions so right now we're at a point where we're putting in a few restrictions so we've got a profile and made some changes around change hat the change hat stuff is really minor basically allowing unconfined to use it where it just wasn't at all allowed before and so what will happen is when you're unconfined and it's not just unconfined when you're unprivileged and unconfined you can't use username spaces so you need some kind of policy on you when you're going to use that same with change profiles a little different we'll get to that but I don't know how I see Linux has gone with this experience on your policy but we've broken LXD and Chrome and Firefox it's been fun so one of the things we did is to help help with this transition is we've added a unconfined flag to profile policies that allows us to do this intermediate step where we can specify that we're going to let you get past these restrictions you get a label on you and we can use profile so that we can use that unique name label in object communication type stuff and we can start developing and extending the profile the policies around these so that it's just this is an intermediate step that lets us do a lot less work essentially and as all the unconfined flag really does is it lets us write policy that's doing unconfined quicker you can write that policy already but it's work so the change profile one that I mentioned so what happened previously with unconfined and this was fine this was allowed in the model is you're unconfined you can opt into confinement and you say I'm going to go into this profile and when you exact you're still under that profile's restrictions and you're going to the chosen profile however if you have a system policy normally that exact would transition to the system profile now our advice to people is if you don't want to allow that confine the person confine the user don't be unconfined people don't get that so this change the change here is not that we're going to restrict unconfined a change profile under unconfined but what happens is now unconfined when you do a change profile wouldn't normally be allowed to change the system policy is we go to our stacking or bounding and so you get your chosen profile and you're stacked with unconfined so the behavior is the same except for when you do that exact that transitions to system policy and that system policy is applied so now if you want to use you know be a developer and do this bypass of policy you have to be privileged and if you can do that you can still use change profile as a privileged developer user and specify I want to go into this because I'm testing things but if you're not privileged you're not going to be bypassing system policy and tooling we've had a fair bit of tooling updates our policy compiler is somewhere between 30 well 1.3 and 2.5 times faster we've been doing a lot of structural rework in it as well just like in the kernel of course we added the support for all the kernel changes that were done there's a new utility AA load that allows loading cache files so system D itself already has this built in it's just calling the library live app armor to do this system D does it directly but if you have a system policy compiler and everything on your system it's just a thin shim layer essentially we've been working on the live app armor for interfaces to improve them do some holding of open FD so all the interface right now is based on file descriptors there used to be a lot of opening and closing and reopening and re-closing kind of call which is terrible for D bus because our D bus mediation has to go through this so there's been some work there to clean that up and hold descriptors open for the life and stuff like that and ideally we want to transition to something like an I octal or we're not there yet or syscall whatever so that we can do it all in one system call instead of the multiple had to pick up support for all the rules and our introspection we've been working on improving the notifications and support for all the new flags and ability to filter thing based on setting in arbitrary filters because I want to just look at this type of thing stuff like that and we've got we've got armor for opening that's our user space it's got hundreds and hundreds of buck fixes and everything just aren't working so that's quality for that so anyway my name is Paul Moore we've talked in the state of SC Linux as it stands today what are we doing on time? we're good? alright so last time we did this talk was just over four years ago so I'm going to try and catch up on four years of SC Linux development in the next 10-15 minutes so it feels to say I'm going to miss a few things I'm not going to cover general performance improvements buck fixes and all that but just assume that there were lots of those four years of development across kernel user space policy there's a lot there but one of the big things I kind of wanted to start off with is the premature technology at this point it's been over 20 years in the upstream Linux world that's a long time it was one of the first LSMs but we still see it being added to the system in fact we've kind of seen a little bit of a renaissance lately in the past few years clouds helped out a lot with that we've seen I've listed four things here in the cloud I'm sure there's been more so your favorite cloud thing with vesting googling around is the first thing that I hit so Linux has it which is relatively new which was just announced earlier this year Amazon Linux has it Apollo rocket I'm sure there's others which I'm forgetting my apologies embedded space is another one we used to talk about the Linux desktop somebody said a while ago we basically have that scan group and nowadays I mean there's last I heard it's somewhere over 3 billion I think 3.3 is what I saw at some point but at the 3 plus billion Android devices over 99% of those devices are running SCSX in enforcing mode so that's just that's the Linux systems and that's just folks we also actually we have windows laptop and we have windows subsystem for Android and we're actually running SCSX in your windows system it's not in the windows kernel but it's in the Linux kernel and Android stack so that's kind of cool also I just learned very recently in the past few weeks that automotive great Linux has switched over and supports SCSX now which I thought was kind of cool and during our talk yesterday we got on both these cases so it's another area which is kind of interesting and like I said I'm only touching on cloud embedded there's other things there's been some movement there with new distributions switching over to and I'm sure there is other things there's like yacht devil and open embedded and all that stuff but these are some of the highlights and this was just some development statistics and you know it's four years so you can take this for what it is they're all big interesting numbers you know kernels had almost 10,000 lines of change the user space has you know 850,000 the reference policy is knocking on 50,000 lines of change and you see we've got hundreds of commits but the thing that I really kind of like to see in the thing that makes me happy the most is contributors over the past four years user space is 51 and reference policy is 40 different contributors so that you know as somebody who's worked in the community for a while it gives you the warm fuzzies when you see it's not just the same people all the time we're getting new people in that are using SC Linux and that are contributing to SC Linux and so I think that's always kind of the mark of a good community project so please remember that from this slide and kind of building on that we added a couple well not a couple three new projects to the upstream you know SC Linux project or GitHub and I think most all of these projects have actually existed before so they're not new in the past four years but we've worked with those maintainers and kind of said you know hey we're trying to consolidate everything you know would you mind contributing your stuff there and everybody was really happy to do it and one of the things I'm proudest of the most is the SC Linux notebook if you haven't checked it out please do we always talk about we need more documentation in open source projects and the SC Linux notebook I've never seen anything quite like it there was one individual who just was retired this was a hobby project of his a three or four hundred page document about SC Linux and it covers just pretty much every aspect of it and it's amazing especially when you consider this was a spare slide project of his and he was gracious enough to donate that upstream for public release and we've done some additional work to you know notably keep it up to date as we add new features we update the notebook we've also converted it fully over into markdown so what's kind of neat about that is you can go to GitHub and you can download a full PDF and we have ebook reader formats as well but you can go to individual chapters on GitHub and just reference that and it's rendered correctly in your browser so it's kind of neat so if you need to send pointers to somebody over email you have somebody else how do I do this, how do I do that and like well here and you just send them a link so everybody likes logos to make cool stickers and whatnot we've actually had just within the past year we had somebody contribute a new set of high color, high resolution logos for SCLINX you know we've got the original ones I think everybody's used to seeing you know the little penguin depending on how you look at it it's either a padlock or a jet engine I always saw the padlock until somebody said why is there a jet engine but anyway that's there as well as the new one and I just want to point out you don't always have to be a software engineer to contribute to open source projects if you want to contribute documentation if you want to contribute artwork we would love to have you as part of the community and there's a place for you to contribute it's all in GitHub and last but not least we have a policy linter SCLINX has existed for quite some time as we've been doing that developer is now at Microsoft with us and our team but we've moved the SCLINX project over into the larger SCLINX community group so that now we have official I guess policy linter which you can use and if you want to help contribute to that patches are always welcome now we'll kind of start getting into some of the more details SCLINX is all about access control so we've got a number of access control additions and this is motivated by new kernel developments outside of SCLINX for example IOU ring, Perf all the asterix notify, denotify I notify, FA notify we have controls for all of these things we have support for a couple of new network protocols multi-path TCP and the management component transport protocol I don't want to tell you how many times I mixed those up when I was typing up this slide anyway, for those of you who know what those protocols are we have support for them we've also added support for the move mount syscall and also username space creation we have control for that as well I almost left this off the slide but it was kind of an interesting little thing if you followed it upstream but during the four years we added support for the kernel lockdown functionality and then we removed support for the kernel lockdown functionality but kernel lockdown functionality still exists as a standalone LSM we just don't provide support for it in SCLINX we've also done a number of file system labeling improvements we have individual file labeling support via genfscon technical detail I can't get into in the few minutes we have here but binder FS, BPFS and security FS individual file labeling support we also added persistent file labeling support via X adders to UBFS we support anonymous inodes which is a kernel concept which probably doesn't make a lot of spend from user space perspective but it's what backs IOU ring user fault FD and MFD secrets we had support for that so that when you do these functionalities in the kernel you can label them with SCLINX you can do type transition roles all the things that you'd expect to do and we also added fallback labeling support for specific file systems that will allow you to use X adder labeling if it's present but then fallback to genfs labeling not particularly useful for traditional file systems like ext4 stuff like that in the play is when you have composite file systems like vert IOFS where your backing store underneath could change it could be a file system like ext4 that supports X adder labeling or it could be something like that that doesn't for example and so this allows you to create your policy to support X adders if the file system has it and if it doesn't fallback to genfs and user space has seen number of changes once again just picking out some highlights so we added a couple new library apis there's apis for querying validate trans policy rules which if your policy guru that you know what that means and that's wonderful if not don't worry it just means you can completely blown up your sce linic system I mean like think it won't even run system b fails on boot you've probably seen okay you need to restore or you need to relabel your file system on restore con is one of those tools that you used to do that historically it was one file at a time and if you're doing a file system with a lot of files that would take a while but we've now got support number of new policy analysis tools so if you're doing policy development these can be very helpful for you very friendly too many tools to go into detail but they're there you can look at documentation we've had just some general policy format improvements we've added what we kind of call the greatest lower bound policy construct for those of you who are familiar with multi level security or MLS it's usually information assurance people government customers this allows you to basically take two MLS labels and generate essentially their intersection which can be very handy for a number of problems starting to solve MLS technologies are also using for something called MCS which you might be more familiar to the separation of VMs and it's also been applied to container technologies so this would also be available for that although arguably perhaps not as useful it's more of an MLS functionality we also improved how we stored file name transition entries in the binary policy this isn't something that's really user visible in the sense it doesn't add any new capabilities any new functionality however it did manage to cut the size of the Fedora policy so depending on how your policy is constructed if you've got a custom policy you could expect roughly similar size improvements and obviously with a reduction in size comes an improvement in load times it's a nice change beyond just the format improvements we've made a couple policy capability additions the first one I always love trying to pronounce actals so actals skip zigzag this basically enables those two actals which I'm not going to try to pronounce without any explicit policy allow rules we won't go into details but basically these are some very innocuous actals and policy in general just supported these anyway and they're similar to some other operations in the kernel that we were implicitly allowing it just kind of made sense to provide this ability for policy developers to allow these actals without having to explicitly allow them in policy but the default is for this policy capability terming off so don't worry about it if you don't want to allow these actals for whatever reason your default is not going to change you're still going to have to explicitly allow them the other policy capability that basically does exactly what says on the slide it allows simlings kernel based file systems pseudo file systems to inherit their label so their security properties from the parent directory tends to work out pretty well but once again defaults to off so if you want to make use of this you need to ensure that you've got policy capability flipped on in your policy and reference policy changes apologies it's a little small so a ton of reference policy changes you saw the development statistics but some of the highlights a lot of system D improvements probably the biggest one was this concept of user surrogate domains which is to help enable some of the system D user support also support for container engines has improved dramatically I've also added Utica support as well as some initial kubernetes admin support and talking to the people that have done that policy they're definitely very interested in feedback so if you're using kubernetes and SC Linux they would love to hear from you they would love to work with you to help make that better we've updated some of the MCS constraints I talked earlier about MCS we've updated some of those constraints to better reflect how things work in the real world so reference policy interface to opt-in to MCS enforced separation and sharing and we've added a policy boolean to disable boolean changes so basically if you want to harden your policy a little bit so that people can't go in inadvertently toggle something on or off by SC Linux policy booleans you flip this boolean and it basically locks everything in place and also renamed a lot of the var run t types just to run time t to avoid any path specific naming we generally try that to encode path names in the file labels because always somebody is going to put that file in a different location and then you have these awkward things of like why is this var run t when it's stored under opts foobar whatever anyway, that should make it a little more friendly and we've got I just call these SC Linux adjacent changes even though some of these were actually changes in SC Linux code but they really these are changes that impact how SC Linux works with the rest of the world and how the rest of the world works with SC Linux probably one of the cooler things is that IMA which I think hopefully all of you are familiar with at this point IMA also now has the ability to install so that's kind of cool you can include that in part of your attestation we've also added some performance trace points for SC Linux AVC denials and their support for filtering in there as well this is kind of neat in the sense that you can set a perf trace point on a SC Linux AVC denial and you can see you can see a back trace basically from the kernel all the way up through user space so it's kind of neat if you're developing an application policy and perhaps you don't you're not extremely familiar with the application this can kind of give you a nice little holistic view of how everything went down from the application down into the kernel so you can map that access denial to something up in the application so it's pretty handy and the kernel commit message actually has instructions on how to use it so it's actually really well written well documented feature this was kind of a big deal a few years ago and now it's a little passe but we're covering four years here SC Linux user space has been fully ported over to Python 3 so yay we've also added a systemd user service to run RestoreCon so to help you recover things a little quicker, a little easier and deprecations and removals it's the first one particularly excited about the ability to remove the runtime disable functionality some of you obviously will be moaned that and I'm sorry but doing this allows us to harden a lot of the LSM infrastructure inside the kernel because the way SC Linux did disable operation it basically unhooked itself from the kernel internals that it used to apply access controls which was fine but we've since that functionality was added some 20 years ago we've added some capabilities to the kernel that would allow us to mark all of those hookpoints as read only so that you couldn't forcibly remove a LSM but because SC Linux needed to preserve this backwards capability we couldn't leverage that but now that we've removed that functionality we can harden those LSM hooks and make the kernel a little more resistant to attack you can still disable SC Linux on the kernel command line or just not compile it in if you're not going to use it anyway, this took a while this was us working with user space this was us working with the Linux distribution so this was really this was a good community effort that took many years but we finally got it check request protection and there is something that we had at ages ago to work around some weirdness with libc and some systems nobody has really used this in probably well over a decade at this point so we went through the deprecation process and this is gone this basically affects mmap protections was what was the application actually requesting versus what was actually applied by the kernel like I said for some systems it was there which was a bit odd and so yeah so we wanted to get rid of that so that now we're always we're always applying policy based on what the actual change is in the kernel or what actual protections are requested and applied in the kernel and we also deprecated removed 46 policy modules and reference policy to kind of clean up list everybody's name here or I can't say everybody's name here but these are the top 20 contributors across the kernel the user space and policy so I don't know if you're in the room you can raise your hand nobody, okay well anyway if you're watching this on the video later thank you very much there's obviously more people than this but these are the people that have really contributed a lot and I think everyone here is very appreciative of that and lastly get involved if you want to participate in SC Linux development we would love to have you and here's some links to get started like I said we've done a lot of work trying to consolidate everything under this GitHub project so go check it out that's got our kernel mirror that's got all the user space kernel repository is under get.kernel.org there it is and finally most of what we do is over mailing lists we have two main mailing lists the primary SC Linux mailing list is where code development happens and we also have a dedicated reference policy mailing list so if you're primarily interested in SC Linux from a policy development standpoint you don't necessarily care about the code the reference policy mailing list does it have any quick questions? do? okay great does anybody have any quick questions? does anybody have not quick questions? alright well thank you very much if you do have questions I'll be around all day today feel free to grab me in the hallway or what not or in the bofs and we can have a chat so thank you it's on arm64 bpf tracing and the way you patch a knob didn't work I'm going to go in a bit detail and explain what really happened there and how we went about fixing it so bpf lsm has these default callback hooks for example I'm showing the bpf lsm file open hook it has a knob in the beginning and what happens typically is that you need to patch that knob and the program is attached to that hook to jump to the bpf trampoline and then invoke the bpf lsm programs that are attached so the green part that you see on the slide is jitted code and in x86 this was all fine you had the bpf trampoline call instruction like you could go and patch that with ftrace direct calls you execute the f entry programs which typically you will not in bpf lsm hooks but for tracing you could attach it before the function is called right then you would invoke all the lsm programs in one after the other that are attached this is where you are enforcing your bpf lsm based policy and then you go back to the original function and call it here it is a dummy hook so it doesn't do anything then you invoke the f exit programs this is typically how a tracing flow in bpf or tracing trampoline would look like as well then effectively what you do there are a few flags which can control how this trampoline behaves but in a gist this is how the thing works in arm we had everything we had implemented like all of the things that were required to invoke the lsm progs the trampoline jit everything was up to date except patching that knob in the beginning of the function and the reason for that was there was no ftrace direct call support on arm 64 this also by the way blocked this is the main sort of real estate in a function the f entry knob is how you would typically go from jump from the function to k probe to an ftrace entry point or to bpf trampoline or to a life patched function so prime real estate and there was contention on how you go about patching this on arm and this took a lot of time there were many attempts I think there were seven or attempts to get this done and there were technical issues that matched issues as we call them but impedance was matched and there was an efficient solution found the technical issues was that of course BL on arm 64 has a limited range so you had to jump from the beginning of the knob hey hey hey hey ok ok ok hey hey ok so this is not always so it was not possible to jump from this knob to any sort of function that you are tracing and you needed to jump to in a fixed offset trampoline and then prepare the call stack correctly and do that there are also worries about reliable stack traces because here you are bringing a new trampoline in the middle those were addressed with this new trampoline that was added in the middle and yes the ftrace maintainers were concerned about the maintenance overhead of ftrace direct calls but they were all addressed it was addressed with like lot of data and now we have working tracing and LSMs on arm here is an update on what is the community doing with the VPF LSM one of the things this VPF LSM is enabling when we first presented is flexible audit and like policy enforcement in the same sort of LSM layer and what we see is people really like to disagree on what containers are and there is a container ID path set that keeps going on here you have like you can take your definition of a container and like implement your policy in a flexible way how would you do that you have a container manager who is basically the authority in a particular container system on what a container looks like you can have your from the container manager yourself you can execute like a task local storage set you can implement as much metadata as you want to the via that can be accessible from any VPF LSM or tracing program and then you can use that identify that the container manager knows about in your audit logs and in your policy enforcement so you can set blobs and this is something that the KTD project does and there are other projects that follow this path as well and you can do selective enforcement that if you want to enforce certain policies on certain containers with this and all these disagreements on what the container policy becomes basically a flexible sort of implementation detail and people can go about there we call this user driven policy so if you have a use case you can write your policy basically on your specific to your implementation I think like say mentioned this system D again similar situation people are not agreeing on a very core kernel feature but things need to move forward so they wanted to implement the strict file systems sort of primitive on the system D and yes it is a very simple VPF program which you read the magic number from the super block and then you check whether this is in the allow list or you can base it on denialist and the policy is implemented this could have in the previous LSM days we have small LSMs in the kernel that are not major LSMs this may have been an LSM this may be something else in the file system space but here it is few lines of code implementing it using VPF so there are other projects Alexei said that there are less open source implementations there are few that are coming up there are at least a couple if we address this bit here and I am going to talk in detail about why this overhead is there what is the impact of that overhead and the current progress on fixing this overhead in the kernel so for your refreshment here is some assembly code on a slide and the main thing to notice here is this is the disassembly of what the main sort of calling point for an LSM hook looks like and before we look for the patches that I have sent on the mailing list you have a link list the link list contains pointers to various LSM hooks there is like SE Linux there is app armor there is PPF and you are iterating to the link list and then you load the address of a callback into a register and then you call the address in that register so you jump to that this is called an indirect branch and we use a different predictor in the CPU called the indirect branch predictor or dynamic branch predictor based on different CPU implementations called different things this is susceptible to specter v2 or branch target injection attack so since then there have been mitigations to prevent that from happening if you folks don't remember what specter v2 was it tries to trick the CPU that when the when this register when the CPU stalls here and tries to figure out that what is the next address I'm supposed to jump into the CPU is tricking to going somewhere else loading a couple of secrets in the cache we call it dependent loads and then timing the cache to sort of side leak the value of the secret there so this is bad stuff and has been sort of fixed in this channel and boot with mitigations for specter v2 enabled this red pooling is something that sort of tricks the CPU into not speculating or like and I'll show you how that is done as well by the way so that's what this thunk looks like and if you look at what is happening here the CPU executes a call instruction speculatively execute beyond setup target it thinks it is in this infinite loop there pause, elfence, jump back so the CPU's speculative execution engine starts executing this infinite loop and then when it actually jumps to setup target we yank back the sort of the RAX was RA11 there to the stack pointer and then we read from that so it's like basically the CPU is thinking it is executing a while loop so it is not going to use the attacker controlled sort of prediction there so branch target injection primitive is gone from here the impact of this on performance is that any sort of speculative speculative execution leads to side channel but it is also really required for the CPU to function properly if the CPU can't know where to go next the whole pipeline ends up getting flushed there is an execution engine like multi stage execution pipeline the front end of the CPU is responsible for pulling instructions from memory that stalls here it realizes that I was going to execute an infinite loop what happened here and then it has to reset all the instructions that it was fetching and this is what you call as a branch miss so the summary here is currently in the kernel LSM callbacks are indirect function calls indirect function calls are susceptible to branch target injection the kernel uses red pullins to prevent branch target injection and this is the really important bit since last year newer Intel CPUs implemented announced IBRS so IBRS is a mitigation that partitions the branch target buffer so that user space cannot influence predictions in the kernel space so between privileged levels effectively but what happened is there was a new attack that occurs due to enable IBRS and red pullins together so we thought red pullins were gone in newer CPUs and we will not have that overhead coming forward but it's still there and it's still expensive so the solution is actually if you think about it it is pretty simple the implementation gets a little bit complex but from a principle standpoint it's a very simple solution we know what LSMs the kernel has we know the order of the LSMs roughly at early boot time so there's only two ways you can change the order of the LSM you can either change it as config LSM parameter in the K config or you can pass an LSM equal to to the list of the LSMs one solution would be to not have the LSM equal to but that is something that but then you will have to recompile the kernel each time you need to change the LSM order so it's unacceptable but then the other thing is that what you could do is at early boot now that you know the order that has been finalized you go to these call sites you patch instead of having an indirect call you put these call sites you have some sort of dynamic code generation there do we need to implement all that dynamic code generation for the LSMs? no there's a lot of areas in the kernel that are sensitive to this performance overhead as well you can think about networking about KVM which is something called static calls and then in this thing called alternatives.c it looks at okay where are these static calls where do you think I did not know the address at compile time but now I know the address and it goes into these places and it puts call instructions directly in sort of the text section and this is what it looks like after that this is actually like a proc K core assembly dump on what the function BPRM committed credits looks like this is effectively a knob in the beginning like a five byte knob and then once the kernel boots and knows the address it changes into a call instruction and this doesn't have the address of the SE Linux BPRM committed credits or BPRM LSM BPRM committed credits is known at compile time there is no speculation via the indirect branch predictor here so we've done we've not doing anything with the red polines here now yeah but this is worth doing and I'll show numbers why this is worth doing but why do we really need it so first thing is everything needs to be done at compile time these slots that you saw there that are changed to call instructions they need to be put there at compile time and we also don't want to put there are 10-12 LSMs in the kernel we don't want to put a slot there for every LSM and keep generating code when we don't need it so what is the max number of LSMs that you've ended up compiling in the kernel you may not be compiling in something like YAMA or like if you're using one major LSM you may not be compiling in something other so you also want to do some macro magic to figure that out if you're interested in how that macro magic works the series is on the Linux security module mailing list and it's fun so how does it have an impact so I executed like a syscall and went fd create I think I executed about a thousand iterations of this syscalls and I did a perf record and saw how many branch misses we were getting with without sort of in the red pulling case and without the static call case and as you can see like the branch misses went down by 200,000 so this basically means the CPU is not stopping and waiting for instructions 200,000 times in the front end the branch loads were reduced as well and the branch load misses were reduced as well so that these are the performance counter statistics but what does it really mean what does it mean for benchmarks and there was a lot of improvement in exact throughput in pipe anything that is syscall heavy is going to benefit from that anything that is syscall heavy and that has an LSM hook there is going to benefit from this stuff performance improvements that has been up there for taking not just for any LSM like not just for BPF LSM but all of the LSM framework can you folks guess what this is? any idea? the number of SELINICS denials probably very close but I think it's what Paul said that there are 3.3 billion devices with SELINICS installed and these are roughly branch misses that are happening per second in the world with SELINICS installed so we can get rid of those once we get these patches applied there is one other problem we need to solve this is particular to BPF LSM and that is as I said there is a default callback for every LSM hook it returns a default value which means that it is making a default policy decision when it shouldn't be making but there was the feedback was that this needs to be split into a separate patch there is the BPF LSM callback is toggled only when there is a program attached so the side effect is gone as well so there is no extra overhead as well because now you have a static key there static key is another fancy construct if you know about jump tables it's like the modern name for jump tables this is my last discussion point BPF is a major LSM and we need to empower the community to write and contribute more LSM hooks to the kernel there is a lot of intellectual capital that now has access to the LSM framework and there is feedback that maybe if we had another LSM hook at this particular point in the kernel it would be good for auditing or a policy decision I think and this is again a very up for debate but my sort of request here is that let's not require these people to contribute a non BPF based implementation I think we should require them to contribute and implementation that exercises the policy in the mainline kernel a reference implementation self test or something but requiring them to contribute like an implementation for an LSM framework that they are not aware of is just raising the barrier to something where people are okay I'll just use tracing or fault injection to do my policy enforcement this is bad for stability and this is also bad for the cross pollination of information between two communities so with this I thank you all for listening to me thanks any questions thank you