 My name is Daniel Feinberg. I'm from Pantheon Systems, where David Strauss is from. We do Drupal and WordPress hosting, and some of our customers, you know, don't necessarily have get skills and things, so we, you know, deal with a wide range of different kinds of skill levels with our customers who host their Drupal and WordPress websites. I'm a senior software engineer at Pantheon, and we work with SystemD quite a bit in Kubin Docker. We don't have as many servers as Facebook, but we have about 700. And my Twitter and my blog, which is actually hosted at Pantheon. So I'm going to talk about monitoring file system changes. Our customers upload their Drupal and WordPress code to our systems, and we'll be able to present them with the changes that they've made so that they can approve them and merge them into their different dev test and live environments. Our dashboard looks a little bit like this. You can see that my site here is an SFTP mode versus Git mode. So our developers who can use Git can just push their changes up, and they can see their diffs on their own. But for our SFTP clients who don't use Git, we need to be able to present them with the changes and allow them to commit them through a workflow that doesn't require Git knowledge or command line knowledge. And so you can see here, if I make a change through SFTP up onto our server, we'll get a message that there's a change, the change there, and yeah. So in that, we do system D for our customer containers, and we have many of them on each machine, and it's all very nested. There are also directories that we have to ignore. We only want to track changes of their code. We don't want to track every time they upload a new picture for their blog. That is unnecessary. So we have deep nesting and many directories, and we want to be able to present them with the changes. So we also get the diff stat from Git, and that's what they can also see each individual's files changes. When we originally built this, there was a system in place when I arrived at Pantheon, and when we originally built it on the dashboard, it said that we will present you with your changes within five to ten seconds. So when we rebuilt this system, that became our SLA to the customer, and the 95% here gives us the ability to filter out our success for customers who make mistakes. So if there's nested Git repos, it can cause problems with this. Get code bases that are so large that Git can't possibly do it within a reasonable amount of time. Those scenarios allow us to measure success at the 95%. So there were a few solutions. We did an initial evaluation of solutions. The initial solution that was there when we began was using iNotify. We had a little Python script that would sit with iNotify and monitor for each container. So now we have a sidecar container for every single one of our customers, and it's eating up resources on our nodes that are not necessarily for customer workloads. And so it was very resource intensive. On system startup, we'd have a raging horde of all of these starting up and causing our systems to crash. And then we also found that we had a lot of iNote exhaustion. In addition, it doesn't support nested watches on nested file system structures and things across multiple mount points. So then we decided to do a proof of concept with FawNotify and Hanotify, and found right away that these things don't support certain file system operations that were very critical to our use case. And then there was the thought of using a file system fuse plugin, and that came with latency and problems and such. So that was out as well. So that brought us to AuditD. We were toying with BPF. Some of us really liked that idea as well, but we didn't get to make that happen in this implementation. AuditD gave us a fine-grained detail. More detail than we actually really need. And it was already there. Luckily, we already have audit on all of our systems. It's in the kernel. It allowed for us watching whatever we wanted to watch. And the learning curve for our engineers to build rules for AuditD was a little less steep for us than BPF. We don't have a lot of C engineers. David here is one of two. And this also allowed us to make it so we could maybe use this in the future for other kind of auditing. So to make sure we're all on the same page, for those who don't know how AuditD works with the kernel, it's a kernel module, and it communicates with the user space through a netlink socket. The netlink socket has buffers, and we can write rules and read log events from the KauditD and the kernel for file system events. We chose to, so Slack HQ had an open-source project called Go Audit, which didn't quite meet our needs but was good enough for us to start with. And then we iterated on that. And the reason we chose to do this in Go instead of using the traditional AuditD is because we have a lot of Go developers. And it was easier to extend and get the functionality we wanted to than using Audit and file system sockets to communicate with different pieces. It also gave us the ability to increase our telemetry and visibility. Our old system had zero visibility. Every time there was a failure, we didn't know why. We had no way of knowing whether we were being successful regularly or not. So this system gave us the visibility at each of the different pieces. For those who don't know, Audit gives us our messages in groups. And so for each operation, there's a message group that has many messages in it. So the netlink socket feeds it to our client, and then we marshal. Also, Go Audit gives us JSON output. AuditD does not, without messing with it into socket pipes into other systems that turn it into JSON. We also have a YAML-based configuration, which is right in line with the rest of our systems. We were able to extend the output plugins, and you'll see in a minute here that one of our objectives was to remove the work from the servers. We wanted to get this work off of the servers as fast as possible. So we wrote a plugin for output that allowed us to write to GCP's PubSub message queue. So let's see. Let's go back. So telemetry, JSON, YAML, maintainability, extendability, and visibility were really all of the things that led us to Go Audit and to modify that. So just to go over what these things look like, the rules are the same. We build our rules just like if you were using Audit CTL to add your rules. But now we can specify them in YAML files, which is easier for us to manage. We have a Chef-based infrastructure for managing our nodes. And so the final rule here is the interesting one. What it does is it allows us to monitor those file system requests. So here we're specifying the directory. The dash f is filters for your rules. So that says where. That's where we host all of our customer containers. And then there are the file systems. And this says we only care about write, but that is a very broad filter. The write filter actually gives us any time it gives us deletes, updates, and creates, which is actually what we care about, including move. And then we only care about successes and we don't care about root. In this case, we're not monitoring for the pantheon employee, we call them pantheors, who is running root commands. We're looking for just our customers and they run as a container user. So we wanted to avoid that. And then Audit D gives us the ability to filter with regular expressions. So we were able to get a certain distance here with the filters in the rule and the dash fs, but we weren't able to get all the way. And so we built on top of what they already had, which was you could filter by syscall. So if it was a specific kind of system call, you could then run the regular expression over it. We added by rule key, which is the dash k when you're creating a rule. So originally we started with all filters here in the go audit process and user space and found that that was too latent so we took the ones that were static that didn't require regular expressions and moved them into the dash fs. Also you can only filter on very certain fields of the messages with the dash f in the rule and that's doing the filtering at the k Audit D level versus doing it in the user space. So in this case here these are running in go audit in the user space. And you can see things like here all of our users are our customer users are all UIDs. So I'm sorry are all within a certain range of EUID numbers. And so here we wrote a regular expression to filter that. Another example is we didn't want to monitor chef. We don't want to monitor our customers picture uploads in their directories. Like I mean that would be an immense amount of information. And then finally we have a default rule that's at the end that drops anything that didn't get kept. So there's a couple others here. The two others there are containers, the container ID slash code. That's where the customers keep the code. We would have two rules for that depending on where the customer is at in the current working directory. You get different values for the name field. So it might have a directory structure. It might be a full path. It might just be the file name. And then having a CWD as another property on the message. So we had to cover all those bases there. So this is a create message from the audit system. And you can see here that we have our ID of the container. And again we would need regular expressions to be able to do this. You couldn't do this in the filter in the audit rule. And here's a delete message. These are just kind of for you to get a feel. And you can see here name. And what's not on this is that there's a CWD that's got the actual directory. So we had to have multiple different rules. This is an update. And you can see that there's two messages in here. One is the CWD. That was the example. And the other is the parent. And then you have the actual message. So there were a lot of nuances. We did a massive collection of data while the other system was running to prepare and audit all the different kinds of messages that we would get and the structures and formats of them. Which eventually will hopefully become a comprehensive test suite for modifying the rules. Because that's very risky for us. Right now we do a slow rollout. But it's still fairly risky to roll out changes to the rules without breaking things because the rules vary in structure so very much. We tuned the system to allow for dropped messages. So we run this on 700 servers. And we monitor the dropped messages inside the kernel and also in the channels which is a way of communicating between our go routines. They're effectively buffers as well. When our buffers are full in the netlink socket or inside go audit we drop them. Luckily our requirements were able to handle that. A customer very rarely uploads one file. It's usually an rsync or many files and so if we miss a single message of 20 files that are uploaded or modified it's not a huge deal. We'll capture the idea that they've changed the files and then we'll present them with the change. So we monitor all those things and much much more. All of it is tunable so the buffer size the number of workers reading from the netlink socket the number of workers shipping to pub sub to get it off which we'll talk more about in a minute. And for the buffer size you have the RMem underscore max property in the kernel that does increase the maximum it doesn't increase the default. So we tell go audit hey you have way more space in your netlink buffers go for it use it but we don't tell TCP for example that it has more room. Because that would blow up our memory footprints a lot on the servers themselves. So let's see. We have a lot of visibility into the system this might be hard to see it's a little washed out but this is an example of our total volume across all of our servers we're receiving about 100,000 audit messages a second. Most of which we don't care about the second line there in yellow is the number of messages that we are filtering out so of the 100,000 we're filtering out 10,000 messages right at the go audit level in the filter pipeline. And then the other two are the incoming and outgoing of our output module which is writing to GCP's pub sub using HTTP to HTTPS to so and you can see that like it's really not much more it's between 110. So we're going from 100,000 audit messages to just shipping 10 per second to our next stage of the system of the distributed system. So this is our latency chart this is overall latency we get the message to success how long does that take we don't care about the ones we're filtering. We don't care how long it takes to filter them out we care that the messages we care about are getting shipped out and it's in the hundreds of milliseconds which is acceptable that for the green lines the upper 99 the yellow line is the mean. And that this is very acceptable with a five to 10 second window to present to the customer right I mean 250 milliseconds is very low. And this is our dropped messages and this is actually the kernel saying hey I can't or a chaotic D the kernel module telling us that hey the buffers are filling up but again it's like 2.5 messages a second which we can afford to drop. If we have a different use case for the pipeline in the future we can always tune the RMM max and the kernel and get this to go to zero very very quickly. Slack it actually has theirs at four times the size of ours for their buffers and we're actually processing more messages than they are. So about shipping off the machine so the primary goal was centralized processing our dream is everything in Kubernetes it makes our life easy when we have one system to worry about versus multiple deployment and pipelines and all of that. So you can see that we have our servers running and we have our pantheon audit on there and we are shipping to pub sub in Google and we actually have a proxy service which enforces topic to oh you in our certificates we do MTLS for everything. Our organizational unit to topic mappings for security so we consider all of our customer code executing and points to be not secure. We consider them to be not trusted so this gives us the ability to prevent other things that are writing to pub sub and topics for internal services to not be written to by app servers and our DB servers and our other servers ever. And then that goes to a topic and anything can listen to these topics we could write any number of services that subscribe to any number of business domain topics here for each rule. The way that we specify the topic for a rule is with the dash K so you can give a rule in audit a name with the dash K that's the key. When you do that all of the audit messages are then tagged with that key which gives and then we ship to a topic with the same name as the key. So it's kind of all the way through the system that that audit key ties it back to the rule and then we have a service that sits in Kubernetes and listens to pub sub and duplicates and windows the process so we do a five second window to aggregate and then update the user for each container. It scales independently so each of these pieces is now scalable independently as we as and which allows us to bring balance and not have to worry about what we have too many people using the SFTP service on one server right. This allows us to distribute the work and centralize this work distribute that work and keep everything separate. Also so so we learned a lot in doing this project. We learned that audit can cause kernel panics. We also learned that the journal scene and system D seemed to like to do auditing as well. And so we had to like find that we had to mask the socket that goes from K audit D to Journal D and things like that don't want duplicate messages. We don't want to swamp our logging system with these audit messages. I mean 100,000 messages going into our logging system would only make it worse. So we saw these journal panics but they kind of went away as we tuned the tunables and dealt with the back pressure created more workers to to continue to keep up with the load. We tried to I was talking earlier about how dangerous it is to change the rules. So the other thing that we found is that we needed to do slow rollouts walk crawl run with rule changes. We do these things called feature flags where we do like we're using chef. So we set an attribute hey turn this on on each server and we have a system that does you know 5% 10% it's a script. And that enables us to slow rollout rule changes to see that if we're going to break things ahead of time. We also had a lot of learnings like I said earlier we started with all of our rules in the go audit user space land and that was we saw our filter latencies go off the charts like that part of the pipeline wrecked our SLA. So we had to order the ordering and the in kernel filtering or in kernel module filtering was key to us being able to deploy this out at scale and still hit our times. And also we're I'm with the data collection we did we should be able to now once I build a automated testing framework with all of our sample messages be able to figure out how to get latencies out of that and make sure that we're not over spending too much time filtering messages. So yeah, we're hoping that other members of the organization do pick this up and start working on using it for other things. We had one of our engineers use it for customer runtime profiling to see what customers are actually running executables. And yeah, and again, we really wanted to use BPF for this and in the future I'm hoping we can pull out the audit piece and put something like just frases new BPF D or something on each end point and be shipping that information, writing small BPF programs that execute on each server and special thanks to my team and Slack for writing go audit and yeah. And so if there I don't know if there's any questions or anything. But yeah, that's that's it. Any questions. If not, I can show we even have latency charts for like the parts so like this is the just the output or latency. This is just the filter pipeline. This is just the parser. So we can really dial in and see exactly where the latencies are happening because at this volume, if something gets latent, it just crushes the whole thing. So yeah, yep, I can repeat it back into the mic to if we just want to do it that way. So, so the question was, are we monitoring inside the container at the host machine level. So because we're doing system D, it's not, we're not truly isolated. So we have one directory on each server that has all of our customer directories and each one of those is a container root. And so our containers at this point in time, we're actually changing our container runtime as we speak. Jesse back there is working on that and it's really awesome. But for right now, our customers are running in very loose system D containers that are using C groups and things like that to create the boundaries and, you know, capabilities at the kernel level for for restricting stuff like that. So we're actually doing host system monitoring. We are going to have to adapt this because customers are going to start running inside real Docker containers. And once that, well, iffy, it's system D with run C and a bunch of other sauce. But when we get there, we'll have to do some modifications to make it be able to add extra information about which container it's in and be able to execute inside that container. So I think there was another question over here, which the auditor itself, the shipper, the shipper isn't containerized right now. It just runs in system D. Apparently I am done. Oh, repeated. The question was, where are we running the go audit? Go audit is currently running in a system D container. It has like a system D slice, you know, the whole thing associated with it. We're going to have to dockerize it for the new runtime. So eventually it will be containerized. You just have to give permissions for the net link sockets, right? So to be able to access those net link sockets from within a container, that container has to have the running, has to have the capability for doing so at the kernel level. And that's my time. So thank you very much.