 Welcome to Challenges in Cloud Native Forensics here at Cloud Native Security Day, happy to be here. I'm Andrew Krug, I'm a security geek at Datadog and also a technical evangelist, and up on screen here is of course where you can reach me if you have questions or feedback about the talk. I started my career in forensics a long time ago. I actually had the pleasure of participating in a college program to teach folks industrial and law enforcement style forensics. In today's time, there may be a variety of reasons that you have to perform forensics, two primary domains being of course security and legal incidents. Things such as breaches due to misconfigurations, which we see all the time, or even incidents like employment law related things will force you to sift through logs and disk images and memory samples from your environment. Regardless of the reason though, I tend to think of forensics as kind of the definition of that purple box there, which is telling stories that occurred in a specific time window using facts that can be derived provably and repeatedly. So let's just talk for a minute about what it means for a process to be provable and also repeatable. We used to talk about this as using things that we called validated tooling and also keeping track of the chain of custody for pieces of evidence. And tool validation put simply is just that we can use a tool over and over and that tool has been studied and proven to yield the same results hundreds or thousands of times in academic tests, in stress tests and also proven not to modify the actual artifacts that we've gathered as pieces of evidence because you might think of how important that actually is to have your artifacts be the same at the beginning of your investigation as they are at the end because we can prove beyond the shadow of a doubt that we haven't actually sort of futzed with the evidence that we have available. And then that chain of custody is just a complete record of who checked out what piece of evidence, who may have modified it, et cetera, so that we can go back and we can see the complete log of just what happened over the course of an investigation. This is one of the things that's notably gotten tiny bit easier with cloud provider control plane logs like AWS CloudTrail. Which brings us to kind of the problem statement here, which is that the more we embrace kind of this idea of devops and cattle, not pets, the more challenged some of these forensic processes do become. And that's due to like kind of three different distinct pillars that continue to challenge us when we talk about cloud native forensics. Ephemerality or short-lived instances. Scale, which is of course the number of workloads that we might have to perform forensics on. And also the scope, maybe it's a single AWS account or one Kubernetes cluster or it could be hundreds of Kubernetes clusters depending. And also technology, the very technologies that we put in place to help us do security sometimes actually hinder us and we're going to talk a little bit about that. But first let's talk about those short-lived instances or short-lived workloads, ephemeral workloads. I still remember a time not that long ago when you used to put servers in racks and you installed operating systems on them. And then they ran for five years before unracking. Today we have much shorter and shorter lived workloads. And what you're seeing on screen is some facts from a data dog report which is how long different classes of workloads exist. So in serverless compute you might have a workload that lives for four minutes. In orchestrated containers they might live for half a day to a day. And in unorchestrated containers they might live for something like four to six days. So that's not very long compared to the span of years that something might be in production providing compute versus today. So we're cycling things out quite a bit faster. And that actually is a problem for forensics because it destroys evidence. When you do a deploy your containers go away. When you do deploy your EC2 instances go away. When your auto-scale group expands and contracts we're actually throwing away valuable instance data. And according to Google's state of DevOps report the average company deploys 626 times a year. That's almost twice a day for kind of a medium competency DevOps shop. That's quite frequently and considering that the mean time to detect an incident or a breach could be weeks or months depending on how you find out about a breach. You could already have kind of gotten rid of all of the evidence that would help you kind of break the case and figure out exactly how somebody got in what they got out etc. So that's kind of a problem. And scale adds another dimension to this as well. So most environments running orchestrated containers or any sort of orchestration on top of AWS or GCP or Azure are running a lot of workloads. And this is just a statistic from one single report so take it for what it is. The bottom line though is that most environments are large environments kind of by design now and many are multi-cloud or multi-tenant. They have containers and EC2 instances and this makes the evidence collection and custody chain problem even more prominent than it has ever been. And for years now I've been doing talks on forensics in the cloud using cloud technology because I'm convinced that the only way to actually do incident response in the cloud and do this kind of analysis in the cloud is actually to use cloud compute to kind of scale out that analysis effort in the same way that we scale out workloads. At some point though this becomes this incredibly heavy burden if you need to start to collect data from hundreds of instances or thousands of systems in a single fleet all at once. And the pace of technology here that the tooling that allows us to do forensics isn't really keeping up necessarily with the security controls that are kind of preventing us from doing an effective job doing timeline reconstruction. Back in the good old days we talked about two different types of forensics right and I want you just to remember these two distinctly different types of forensics. One was what we called cold forensics or dead system which was where we'd actually take hard drives and take them out of systems, image them, carve the actual file system for deleted files, artifacts that would help us reconstruct a timeline. The second is live forensics and that's the act of getting things from volatile system data that would go away when you turn the machine off and this wasn't just memory. It was also things like network information, process information, things that wouldn't necessarily be resident on disk. And for a time we thought this was actually going to be this massive boon to the industry right because we could do this so much faster using live memory samples to like crack the case than we could carve through a disk. It turned out though that despite the efficacy of this and the fact that we could use it to solve problems we couldn't use disk forensics for. It was short lived because operating systems started to add more and more security features kind of beginning in 2010 with data execution prevention. And address space layout randomization, which would take the address at which a given process was loaded into memory and sort of offset that randomly at any given time and then kernel level address space layout randomization came to be a thing. And that was actually changing the location of PID zero or the kernel, which made it inherently very, very difficult to reconstruct memory samples from systems that were running KASLR. And this effectively broke an entire ecosystem of tools when the security features dropped. And there are a variety of good tools that exist. They just can't necessarily keep up with all of the innovations going on with kind of the randomization of memory. And now if you look at hardware, they're actually putting memory obfuscation features in hardware that makes it increasingly difficult to analyze memory samples. So often what we end up with, even if we do everything right, is a complete lack of ability to analyze live system data at any kind of practical scale, especially as memory sizes increase, instant sizes increase, etc. So we really need to make the art of forensics a first class citizen. We need to be thinking about this as we build out operating systems, as we build out orchestrators that eventually, during a security incident, someone is going to need to kind of lawfully intercept, if you will, the data that's flowing through that system in a way that they can reconstruct it. And that needs to be provable and validatable just like we talked about. So if you think this is a good idea, please go and plus one GitHub issues on any of those projects I put on screen that have to do with ASLR or KASLR. Or if you are on the board of a prominent project like Kubernetes or the Linux kernel, let's have a chat. Let's think about forensics a little bit differently. You too can be an advocate for all things forensics and we can work together to make the world a better place. Thanks again. I'm Andrew Krug. Here's where to contact me. I hope you enjoyed my lightning talk on concerns with cloud native forensics.