 All right, hi. Last time I spoke to this crowd was two years ago in Toronto. All right, so who am I, Richard Gebriggs? And yeah, James pronounced it right. Most people who've read my name on the internet assume that I'm Anglophone and get the middle name wrong. So who am I? So live in Ottawa, raised in Ottawa, one of the few who actually still there or native. Started off doing PET stuff, assembly language, all machine language stuff really, because I was going right to the sources and hard coding stuff. Then graduated to some PDP 11 and Fortran stuff. Computer science initially at the University of Ottawa, but then switched over to computer engineering, got involved with Freeswan. That's really where I, I guess, widened my horizons and got involved in the Linux community. That was pretty significant in terms of also introducing me to the wider world of security and the ITF and standards. That was quite a fun project. Worked on that for six years. And then got involved with a local company in Ottawa doing imager device drivers, again in the kernel. Then I've been with Red Hat since 2012. Online I'm known as RGB or Sun Racer. Being involved in solar car racing, which is where the Sun Racer handle came from. Weird bike guy. So this Humpty Dumpty bit, that's related to this appendage that I'm carrying with me. Let's see if I can... So that sort of quickly explains what this is all about. That happened about 200 feet in front of my house on a bridge that crosses the river right in front. So the picture on the left was taken about six days before the picture on the upper right and the bottom right. I'm surprised I didn't actually set off the metal detector in the airport. In fact, it was my bike cleats that set off the detector in the airport. So what's audit? It was introduced by Rick Faith, Red Hat in 2004. And then since then it's been added to and enhanced and fixed and repaired and patched. It's basically secure logging that's embedded in the kernel itself. So the idea here is that syslog is not a secure logging system. And what's been attempted with audit is to create a system that can be used in a court of... Some of the logs can be used in a court of law to be able to say this attack happened. We know that this happened at this particular time and be able to track things down a little bit more assuredly. It works well with other LSMs. Particular SE Linux is the one that it has been most closely tied to. But certainly other LSMs make good use of it. There's a user space daemon and it logs either to disk or to network. There's configurable kernel filters. So you can set up whatever filters you need in the kernel itself to be able to catch certain events so that various different user space things are not able to be able to circumvent any of those rules or detections. There's also messages that can come from various different user daemons and they're able to log messages into the audit system as well. So it only reports behavior. It doesn't actually interfere and get in the way. Steve Grubb, one of our colleagues, has been working on an intrusion detection system which is related to it. So it would take information that's generated by audit, parse it, and then be able to go and act on it. But those are external to audit itself and it actually uses audit as a mechanism to be able to trigger it. So the next problem is what are containers? There are many definitions and there's many people out there who've been trying to solve this problem of what is a container and so there's various different subsets of namespaces and seccomp and cgroups that have been used to create containers. Unfortunately, there is no consensus in the community as to what a baseline of namespaces are required to be able to do this. If we did have a baseline of this is the minimum namespace that we need to be able to make a container, we could have already used that information and said, okay, well, we can use that as an identifier and go from there. Unfortunately, that's not the case. So the kernel has no concept of what a container is and so we're looking for some help and it's kind of like, okay, well, we need to log some event that happened. We want to know what container did this happen in if it happened in any container at all or whether it was the host itself that somehow has a rogue process. So the next step was we know that the container manager, the orchestrator, knows this information and so it needs to report this. So at the previous talk two years ago, I came up with the question of do we go with a container ID or a collection of namespace IDs? There had been previous work. So, all right, back up a second here. So what's the problem? As far as audit is concerned, there was, you know, the Highlander phrase, there can be only one. So previously audit, you start up an audit daemon and then if you started up a second one, you would basically disconnect the first one or ignore it and continue. That's being fixed now. So if you try and start up a second one, it'll go, no, you can't, there's already one running. Unless it somehow has died, in which case it'll replace it. So the problems we were trying to avoid there was orphaning earlier ones or blocking out newer ones. The audit itself is not able to trace the task that has caused a particular event to a specific container. We had looked at using a combination of namespaces and say, okay, well, this collection of namespaces was responsible for this particular event. So we should be able to trace it back and say, I think it must be this container over here because this combination was registered by an orchestrator. This gets really complicated in it. I guess there are some arguments that have been made that this is a user space job to be able to figure all this stuff out, but it just got really too complicated and it wasn't... We didn't have the certainty or the assurance that that really was the case. So we tried to find another way. The idea here is to make security claims about containers because we're just seeing... We're getting more and more people asking for this particular approach and say, well, we want to run a container but we want to be able to make assurances about what those containers are doing and where they came from. So it's part of the whole tracking mechanism that we're trying to set up. We're also needing it to filter logging itself, so there may be certain containers which we're not concerned about or certain events that we don't care about. So we want to be able to filter the logging itself to reduce the amount of volume to deal with because that can create a denial of service attack in terms of logging stuff to disk. The other aspect, of course, is then doing searches. So if we've got the container ID, we can do a search on the particular container ID and pull up all of the events that are related to it or perhaps not related to it. Further down the road, we're going to be looking at how to route audit messages to different daemons. So right now there is only one audit daemon, but we're still doing some architecting and planning. My colleague, Paul Moore, sitting right in front of me here is the person I'm working with on that. I'm trying to come up with a design or a plan to be able to figure out how to route those messages to different audit daemons and allow more than one daemon to run at the same time, which could take care of a subset of containers and be able to say, okay, well, this daemon here is responsible for that pod or whatever and be able to have a bit more flexibility in terms of managing routing and where that stuff's going. So the conclusion there was that the NSID namespace identifiers, tracking was too complex and it was incomplete. In terms of history, it goes back a bit more than five years. One of our colleagues, Aristo Rosanski, had come up with a PROC INOD ID for each namespace and tried to promote that. There were some issues, the namespace folks said that's insufficient, so we added the device ID to be able to try and nail that down. Then the ID came up to use a serial number within the namespace instead of the PROC because the PROC INOD, there was some reservations about we would like to reserve the right to change the meaning of this if things are migrated or change it another time. So it came up with a serial number I prototype. It was eventually discarded because there were some concessions that okay, this is probably getting too complicated, we need to take a step back. Meanwhile, Al Vero reworked it for namespace file systems and then eventually moved on and abandoned the namespace ID as impractical and insufficient. So the namespace ID patch set, I have updated it. There's still some use for it, but it is insufficient for doing the core stuff of tracking container problems. So namespace concerns are still there, it's just that they're not the primary concern of container identification in audit events. So the conclusion from before was that there weren't any issues with four of the namespaces. The net namespaces were okay for now, we're going to need to do some more work on the network namespaces because of the need to have multiple audit daemons. We were talking last time about having the audit daemons tied with a user namespace so that we could have basically one audit daemon per container. And after some discussion and more wrangling on mailing lists determined that that really is not going to be particularly practical. Pid namespaces were also okay, we'll need to do proper translation, but I think we understand that problem reasonably well. So yeah, namespace versus container ID, I think over the last couple of years we've come to our conclusion that we have to abandon the previous approach. The thing that has remained constant is that at an upper layer, beyond the host itself, we'll need multiple orchestrators, I guess the higher level orchestrators that's managing multiple hosts to be able to coordinate and amalgamate those logs and be able to match up, aggregate them and deal with all of that. So the changes since the previous proposal containers can't be universally identified by the namespace subset and so we have to move on from there. The audit daemon won't be tied to any namespace because there isn't any subset that can be reliably nailed down as being this is a container. The network namespace needs a list of possible container IDs responsible for network events. So the problem here is that you've got some events that can come in from a network and it may not be associated with a task yet and so it's kind of floating there without a responsible parent without a chaperone and so we basically need to say, okay, well this is the list of possible containers that are all sharing this network namespace so it could be one of these who's responsible for this network event. We looked at other namespaces and it looks unlikely that we're gonna need to do this for any others but the code is generalized enough that if there are some other events that come up that don't have any task associated with it we should be able to adapt the code to be able to add that functionality into it relatively easily. So yeah, the namespace identifiers are still potentially useful for other audit logging but they're not pivotal at this point and they're not reliable for what we're trying to do. One other minor thing that's come up is looking at the task struct. There are now three audit parameters which are in the task struct itself and it became evident that it made sense to group them all together into one pointer so that we could abstract it away and this can solve some KABI issues for distributions so that they're not gonna have to worry about how that task struct changes and abstract away the audit stuff entirely within the audit subsystem itself and so there are being some extra functions that have been added to be able to get the audit information as necessary for other subsystems. So there's been three revisions of the design for the audit container identifier and four revisions of code. There's the fifth one that I'm eager to pull the trigger but Paul needs to find some CPU cycles to be able to review patch set four. So in terms of access controls we don't wanna be able to unset the container ID so once it's set basically it's stuck there and can't be changed. We started off with only having right access mitigate or limited by cap audit control and there's been a concern about abusing this particular identifier for other subsystems and so we're not trying to be terribly creative about how those could be abused by other systems but we wanna be very careful about it and so we've added read access control as well. It's basic at this point it's a in the proc file system under each process ID there'll be a new file system entry which is called audit container ID and you would do a write into there to be able to set the container's audit container identifier and then you'd use the same file to be able to read it back but only if you've got permission. Other limitations so we don't want to be able to play games with having a child that's been set with a container ID go back and then turn around and have it set the parent's container ID so at this point it looks like a sufficient access control for that would be to prevent it being set if that task has already spawned a thread or spawned a child. At this point I don't think there's gonna be any argument with this down the road is the child itself is going to inherit its parent's audit container identifier and so once it's set initially we were talking about restricting it so that if it's already been set once it can't be set a second time if it's inherited from the parent there was a mechanism to be able to still allow it to be set once and initially that was by comparing the parent's container identifier with the child's container identifier and if they were identical allow it to be set again that was changed around to actually look at an inheritance flag and that's been since removed because some of the concerns about setting the container identifier a second time have been questioned. There's another angle to it which is to possibly restrict the setting of the audit container identifier by one of the sorry to one of the children of the orchestrator itself so that the orchestrator can't just go outside of its own tree and start setting other children because you might actually have more than one orchestrator running on a system. Seems unlikely but that's the kind of thing that we've considered. We haven't made that decision yet. So the last point is about disabling setting the container identifier twice and that's being removed. So at this point it looks like we will be able to set it a second time. So in terms of what the identifier would be we started off, I don't remember exactly how I started off, some of the discussions I think I might have actually started off at U64 but then it went back and forth a number of times U128 seemed to make most sense because UUID is 128 which is what a lot of container orchestrators are using. That gives us enough overhead that we were a bit concerned about it. Paul really would like to see a U32 but has conceded that U64 should be enough to be able to give us enough bits to play with so that collisions are unlikely. At the other upper end of things 36 char string was also considered but that looks to be far too large and if we're doing logging and we've got a record, an audit record for every event then that's going to chew up a fair amount of bandwidth and we wanted to reduce it to as much as was sufficient to be able to avoid collisions but not big enough to too much bandwidth. So part of the argument here is that we should be able to enlarge it to a U128 in the future if it really is deemed necessary and it shouldn't break stuff. At this point in terms of records there are two new records that are being proposed. One is the initial record when an audit container identifiers first set and that would give some of the background like who's the parent the container orchestrator who's setting this what's the target PID and some other information about the circumstances to be able to nail it down and identify who the players are. The other record would be an audit container auxiliary record to syslog events syscall events sorry if only if the container identifier is set. So if it's not set the record simply doesn't show up. There's a new field that's being added for kernel user space communication. It's a U64 and I guess this is dives into some of the implementation details but we've only got U32s available and so I sort of welded two of them together to be able to make it work. I'm fairly confident that there won't be a problem otherwise I would have had to change the interface and pass it in as a string which does seem far more messy. And of course need to add and delete container identifiers from network namespaces. So when a task gets a container identifier we look at its network namespace and we add the container identifier to that network namespace so that if there's an event that happens it's able to list all of the potential containers who are involved. If there's a second process that's in the same network namespace then it will add it to a list and then if an event happens then it will go through and itemize each one of the potential container identifiers that are involved in that particular event. What remains to be addressed is how we are going to allow multiple audit daemons to be able to run on one machine. So at this point we would have to solve some of the network namespace issues. They don't seem too daunting at this point but there are some concerns that we'll have to take care of. Each one would have its own rule set and its own queue. So if you've got a separate audit daemon running it's going to monitor a set of things and it could overlap with other audit daemons that's fine. Each one is going to receive those messages as is required. The other big one is that auxiliary audit daemons not be able to affect the host configuration or the host audit daemon. So right now audit, when it starts up there are a number of parameters that are set and they influence the host itself. Those have to be disabled in the auxiliary daemons. This isn't a significant challenge. So the next one is how to assign and root audit messages by container identifier. And this requires some architecture work that we haven't looked at particularly seriously yet. It's a matter of setting up a configuration file for each audit daemon that is not going to interfere with or tamper with the host itself. Something that Paul has mentioned recently is the need for LSM hooks to be able to set the container identifier itself on various tasks. That's the most of the concerns at this point. So conclusion, the namespace identifier was infeasible to track containers itself and had to move on beyond that to try and find something that was easier to implement and a lot easier to be able to track. The 64-bit unsigned balances kernel efficiency with uniqueness. Coming back to that the U64 is a single operation to do a compare in the kernel whereas a U128 would have been multiple compares. So if we're trying to manage lists of audit daemons and routing and that kind of thing then those compares could have added up. So we've got a record, a new record now for each of creation and routine events that happen. And there's now a filter in place to be able to filter on that container identifier. The net namespace isolated events get special treatment if we have other things that come along then that could also the only other thing I'm thinking of that could be in that sort of class is some sort of hardware failure where maybe you've got a disk that gets a bad sector or something like and it's going to throw some thing that is monitored by audit and it doesn't belong to any particular task and it might say okay I need to report this event. And again these audit logs at a higher level would need to be aggregated by an orchestrator that's at a higher level than the host itself. And the orchestrator would keep track of all of these different IDs across the various different hosts. That's it. Contact information can reach me at red hat, RGB at red hat or at home RGB at tricolor. The Linux audit mailing list would be the canonical place to be able to raise questions about this and get involved in development. There's also a free nude audit channel if you've got questions but I'm the only one from red hat that's in there so I'll probably redirect you to the mailing list. Anyway, there it is. Questions? Casey? Yeah, I know anybody else. What about nested containers? What about nested containers? A container that throws off a container that throws off a container and people do weird things like that? Yeah, we're expecting that and the logs should be able to elucidate that story and that's a tracking problem that will punt to user space and basically the information's there about this particular orchestrator spawned a process orchestrator and it will have the information for the container IDs of both the parent and the child and then when it goes to spawn a new one then that information will be in there. Is that sufficiently answered? Good. This is actually a bit of a contentious question because it had come up before and it was the subject of and influenced some of the design decisions. What kind of overhead does Audit D and its associated management introduce? I don't have those numbers but they've been around for more than a decade. There have been improvements over time. So like everything, there's no good easy answer for that. I can't just say 4%. One of the things we were just talking about is over the past couple years the queuing mechanism inside the kernel for Audit Event Generation has gotten much better. It used to be we were doing a lot of awful things, sending Netlink messages up on syscall exit. We've now moved that off to a separate thread inside the kernel so that that should improve things quite a bit to be a separate thread, doing all of your Netlink messages up to the Audit Damon so that they can be collected and written out to a desk. But in general, Audit D is really just a collector. We don't want to write to this directly from the kernel. That's bad. So Audit D itself doesn't really provide any sort of overhead. It's all going to be the audit processing inside the kernel. As far as what that overhead is going to be for any individual operation, it depends entirely on what your audit configuration is. If you're generating audit records for every syscall on the system, guess what? The overhead is going to be pretty bad. The good news is we will actually that's fine. You can actually start up the system and it will operate very slowly, but it will operate and you can then shut it down. It used to be that wasn't the case. But anyway, there are filters that you can put into the kernel that happen at Event Generation time. So you can mitigate that overhead to whatever you want it to be. Like everything in security, there's a big knob. How much information do you want and what are you willing to pay for it? You can filter out individual events or classes of events. You can also filter out records. So if there's certain records you don't want to see, then the record itself won't be generated whereas the rest of the event will still show up. Currently there's a pretty heavy use of printf in those messages and there's a lot of overhead associated with that. So down the road, we're looking at changing the API for audit to basically give binary information in a more organized way so that it causes fewer problems for the user space parsers to be able to look for patterns in the logs. That's ongoing work. Somebody at the back. So do you have buy-in from the other kernel developers for this container ID? The primary one that I was concerned with is Eric Biedermann, who's the namespace guy. He had some pretty strong opinions about this stuff. Trying to think who else was pretty vocal about it. Most of the rest of it were user space and orchestrator and library folks who had mostly opinions about the identifier itself, about the size of the identifier. Like I said, it's gone through seven revisions of namespace identifiers and now four revisions of the container identifiers and we've addressed pretty much all of the concerns that any of the kernel developers have got. I guess I would like to get more involvement from the container orchestrator folks because they're the ones ultimately hurrying to have to deal with this and use it. We're creating this because we're intending this only to be used by audit. If you look at the mailing list threads, there's a lot of discussion back and forth and Richard touched upon this a little bit where he was talking about, we didn't even want to allow user space to read the value out because they're worried that people are going to take this and run with it and use it for things other than audit. So we're taking a lot of steps to prevent that from happening and this is one of the reasons we've been able to get by in this time around from other kernel developers because we're not proposing this because we're narrowing the scope sufficiently that other people are less concerned about how they're going to be able to abuse it. So I hate to rehash on a one of the flames we had during that mailing list thread. One of the ideas proposed was to introduce P tags finally so that you could then instead of having a specific feature for tagging for containers for audit, you would be able to have like a, like, audit would be able to be a content of P tags so that it wouldn't be all this special casing where we have to worry about is this really what the kernel as a container is? Has that been looked at since then or is it sort of... discarded because P tags isn't in the mainline kernel and P tags is in LSM and on both counts because there's a dependence there that it's not part of core kernel functionality it's not going to be very useful to us or reliable. Does that answer the question? Any more questions? Over here again. So with regards to the log aggregation at the orchestrator level will that data be normalized so that it can be ingested by other systems? There's already been work done to normalize audit logs anyway and that's been a lot of the preparatory work that's gone I guess I don't want to dismiss it as being a distraction but that's been a lot of the distraction that's led up to this point as we've got a lot of work to do there simply to make all of the events parsable and so that's been an interest of the audit user space tool maintainer anyway is to make sure that we were able to parse all of these events because those events are eventually going to be parsed by other tools even further up and so cleaning up our act there has helped in other layers anyway I think that's it Thanks