 Welcome everyone to the session, Building Resilient Security Log Pipelines with Chaos by Prima Virani. So we are glad they can join us today. So without further delay, over to you Prima. Hi everyone. So in today's session, I'm going to talk about the popular practice of chaos engineering and how security team can potentially benefit from it. And I think I'll just dive straight into it. I hope you enjoy this presentation as much as, you know, I've enjoyed researching it and thinking about it and planning for the experiment. So here we go. Let's talk about the status quo that is, you know, prevalent in the security industry today. In today's session, I'm going to talk about chaos engineering, specifically in the context of security log pipelines. Because, you know, log pipelines are basically the building blocks of a robust incident detection and response infrastructure. And a lot can go wrong with them as we build them, as we buy them sometimes as well. And in today's session, I'm going to talk about how we can potentially make them more resilient and more robust with chaos engineering. So the status quo, the status quo is, as I mentioned earlier, that blue teams are highly reliant on quality of their logs for good detections. And log pipelines are often built without reliability considerations. So what this means in, you know, the real world scenario is that, let's say, if, you know, the security team that's trying to respond to an incident ends up finding out at the last minute that the logs that they wanted to, you know, dig through in order to identify where the incident started from or how it's progressing further. Because the logs aren't available, they aren't able to dig deeper into it and are, you know, blocked significantly. Log pipelines are often built without reliability considerations as well. It's literally purely built for functionality a lot of the times, and quality assurance and reliability generally kind of, you know, becomes an afterthought, which is kind of funny and ironic, because with, you know, regular development practices, the problem seems to be slightly like of the opposite nature where it's security that's kind of coming in as an afterthought and reliability, and so on, you know, generally kind of very well thought of in the design process. So yeah, as we know it, a lot of these log pipelines are often built with really fragile glue code, and it tends to have multiple single points of failures. So that's, you know, that's how security pipelines are built today, as we know it, in a lot of companies cases, especially this, you know, turns out to be true when it comes down to small and medium-sized enterprises that don't necessarily have as many resources to dedicate to, you know, security in general, let alone the development part of it. So it tends to get very challenging. Now imagine it's a one-day morning, you know, and the, as a security person, you find out that your employees' laptops are locked with ransomware. And the list of affected employees is getting bigger, you know, as time passes, and the impact is huge. The business, you know, critical business systems are down. There is absolutely no productivity whatsoever. And the log collection on which you were relying in order to identify where this issue started from, and or, you know, any other trails that you might need to sniff, you aren't able to do that suddenly. Because the log collection on the specific parts of the infrastructure that you're trying to examine, the log collection there had stopped about two weeks ago, which the security team didn't know about. And just imagine the kind of chaos that this would, you know, introduce within the company, within the security team, particularly, because suddenly, you know, you aren't able to investigate into an incident that, as a security team, we are expected to, you know, both mitigate and investigate to identify and uncover the full story. So what are some of the challenges? The log pipelines are built just for functionality, robustnesses, often ignored, lack of incentives for the engineers to build for robustness. And this turns out to be, you know, particularly true in security teams, because as I mentioned earlier, the teams tend to be very small, and there's usually, you know, a lot to do for a really lean team where, you know, even the other standard security initiatives tend to get deep prioritized in light of, you know, something that's more critical. This especially becomes a challenge when there are burning fires all the time. And, you know, that tends to happen a lot if the company is dealing with financial data, healthcare information, or personally identifiable information, especially with financial data, because the target is very lucrative, and there are, you know, at all times, there's someone or another trying to get a hold of this, these financial assets, and trying to exploit the systems, in which case if the overall security posture of the company is not very mature, then, you know, it can be, the situation can be like, you know, burning fires every other week or something. And each one of these incidents can be so time consuming that you're so focused on the short term and trying to put out the fire that's currently burning that, you know, planning for the long term becomes difficult, even if you have planned for the long term to stick to the plan becomes, you know, equally as difficult as well. So there tends to be, you know, lack of incentives for engineers to build for robustness, for exactly this reason, because as it is, there's so little time to build, whatever you are building as a security engineer needs to be functional for sure, but then as soon as everybody knows that this works, it kind of becomes secondary to, you know, iterate further on it and make sure that it's reliable, like, you know, your regular software would be, commercial software would be. And then every device and service is a special snowflake. So anybody who is, you know, dealt with the challenge of log collection knows how sometimes in order to collect logs from, say, 10 different systems, you need to write 10 different solutions, or even if not 10 different solutions, maybe seven or eight different solutions at least, there's very little that can be kind of rinsed and repeated. Unfortunately, that is because a lot of the logs that we are trying to collect in these scenarios are on third, they're logs from third party infrastructure. So some logs that you're trying to collect are from Duo, for example, that's your multi-factor authentication provider. Some logs are from, say, Octa or one login, that's your SSO provider. Some logs are from Amazon and then a lot of your logs are from your internal infrastructure if you have like a combination of AWS and bare metal. Some of your logs are also from your bare metal infrastructure, you know. And all these three, four systems have a different way in which you can aggregate logs from it. So for example, from Duo or Octa, you might need to write your own scripts in order to fetch these logs from the third party APIs. All these APIs have a different mechanism for authentication, a very different schema of the way the logs are structured as well. So if you are, you know, working with a language like Golang or something, then you might need to write structs for every single type of log that you're trying to collect and so on. And then when it comes down to your regular kind of infrastructure logs from your Linux servers and so on, they tend to, you know, for teams to be able to collect them, you need to use syslog forwarding agents, agent like access log or something like that. And, you know, each one of these systems, ultimately, you're trying to collect all of these logs in one single place, ideally, which is your SIEM solution as a security team. But every single one of them has different kind of nodes and hops through which it gets to the SIEM. And, you know, that's where it becomes very challenging to keep an eye out on every single one of those streams and every single one of those chalk points where it could potentially fail, where it could potentially be leaking in some cases, or, you know, have temporary disruptions and so on. So, again, every device and service is a special snowflake of its own and to be able to collect logs from each one of those tends to be, you know, kind of methodology of its own. So, yeah. So we've talked enough about the challenges and the status quo here. Now let's, you know, dive a little bit into what can be some of the solutions to try and make sure that the security log pipelines are resilient and non-destructive at all times. One of the, you know, most basic basics of the solutions are monitoring and alerting on logs, spikes, and dips. So this, you know, monitoring for any time at which there are drops of log packets. So, for example, if you're trying to aggregate logs from Okta into your SIEM, it might make a lot of sense. It might make a lot of sense to monitor for basically no logs coming in for, say, more than 10 minutes. Because, you know, a mechanism like Okta, for example, can be, again, you know, you can introduce more creative ways of monitoring there as well. So when it's the regular working hours, the nine to fives, during those times, the traffic would be so high that even, you know, 10 minutes of quietness might indicate there's something off. However, on evenings and weekends, you might not see as much authentication activity. And so, you know, it might make more sense to monitor if there are any log drops or absence of logs for more than half an hour or more than, you know, an hour. And introducing, you know, kind of appropriate windows like that so that you're also, you know, not flooded with false positives. That is really important to monitor whenever there are log drops. And then, Sandy, check your alert configurations. Because sometimes what happens is the systems that are trying to send the alerts about the log spikes or the log drops, those systems themselves can be down. And if the alerts are not being sent out as expected, then, you know, that's another kind of range of problems right there where you can just never be sure if even the monitoring that you've implemented is going to be accurate and up to date at all times. So it's always a good idea to sanity check alert configurations for this reason and involve DevOps and quality assurance in Dev practices. This, you know, is one of the more obvious solutions out there. But I also understand that this can be challenging to implement, especially because it can be difficult to pull in resources from, you know, what tend to be kind of similarly lean teams, right? DevOps and QA are never really massive teams in most companies, in most small and small to medium size enterprises. So where possible involved DevOps and Q&A in Dev practices? But I wouldn't say that that's always feasible. And sometimes, you know, something as simple as involving the DevOps teams and kind of laying out that foundational groundwork of best practices is also very useful. Once they've laid that out, it can, you know, the owners can then be on the security team to follow those practices that are agreed upon, that are recommended and that are laid out by the experts in this particular scenario. But of course, you know, all these are barely scratching the surface and these solutions are nearly not enough. And that's where we talk about chaos engineering and how, you know, chaos engineering can come in handy in this particular situation. So chaos engineering, as most of you would know, is this empirical and experimental approach that addresses chaos in distributed systems at scale. So this is a method where as a, you know, blue team or as a defender team, we're trying to introduce controlled and planned failures within our own infrastructure at different intervals of times. And then what we do is we observe the failures that in addition to the failures that we actually ingested and introduce ourselves, there could be failures that were caused by the knock on effects of the failures that we intentionally, you know, introduce. So there could be unintentional consequences of intentional actions. And that, you know, is something that tends to happen a lot in real, you know, production environment, but in a real dev test, fraud environment, any of those to observe failures caused by knock on effects is very important for that reason. And then, you know, of course, observing something fail gives us a well defined problem where we can actually sit down and pinpoint exactly what's failing and exactly why it's failing. And then we can, you know, start devising strategies for how to mitigate those failures, how to address them in the best way possible. And then, you know, there are so many tools available out there in the, you know, open source at the moment as well as well as some commercial offerings as well, which you can talk about, you know, a little later, one of these tools, one of those tools is chaos monkey or as it's been developed by Netflix, Netflix has been, as most of you would know, on the forefront of developing the chaos methods. And a lot of what I'm talking about today is in fact inspired by the book that's written by them and published by them, the O'Reilly book on chaos engineering, and I highly recommend reading that as well. When it comes down to, you know, security, particularly where we can, you know, really benefit from these, I can give you some examples of, you know, the kind of scenarios in which the chaos methods might be very useful. So one of those scenarios is, let's say you're collecting logs from one application that has four different services. Now, you are monitoring for when the application logs fail completely, but you're not necessarily able to monitor for when the particular service logs that are coming from that particular application fails. So, for example, if you have, say, application logs, authentication logs, a network logs coming from this one app, and all these three logs are kind of, you know, divided in its own ways that they come through with different tags and different labels and so on. It might be very, it might be a good idea to first of all monitor for failures on one of those three streams instead of just observing for failures on the entire kind of fire hose. And in this similar fashion, it might be very interesting to inject failures into one of those streams and try and see what happens when the failure is injected. Is it affecting the two other streams that, you know, that we thought shouldn't be reliant on this one stream, but somehow turns out to be reliant on it. And if one goes down all three end up going down. And if then if that happens, then, you know, you know that that is one kind of single point of failure right there, which you might need to address. So you might need to address it with the third party whose applications you're trying to collect the logs from and so on. So, you know, that's kind of one of the examples of how the chaos methods can be applied to the log pipelines and the logging infrastructure, as we know it today. And of course, you know, just like any methodology or any system, there's no such thing as a silver bullet that are going to be limitations of every approach that we take at all times. And, you know, some of the limitations that we must discuss here are around, you know, understanding of chaos engineering, the kind of what you call it the philosophy behind it is understanding that philosophy is very important, because a lot of people mistake chaos engineering for the art of breaking. I'd say yes, it's the art of breaking, but it's not only the art of breaking it more than the art of breaking, it's the art of observing. So, you know, never, first of all, never done a chaos experiment that you already expect to fail, because that's kind of counterintuitive right chaos engineering is something that's that should be used as discovery method more than anything else. So on systems that you think you know of work in a particular way, and you're very, you know, you have high confidence in them, working that way. It can be very useful to introduce chaos experiments there instead, because those are the systems that may surprise you, you know, and that's where you will actually get something out of the experiments that you've run, because the systems that you know, and you know will fail and you know exactly how they'll fail. You don't really have that kind of need for discovery there, and to kind of, you know, make them fail or make them break just for the sake of it. That doesn't really make sense. Again, it's futile without appropriate monitoring in place, because unless you can observe your experiment and the knock on effects of your experiment, you can't really, I mean, the whole point is discovery, right? So you can't discover what you can't even see. And, you know, that is a very important thing to remember and keep in mind. Again, this is an approach because it involves breaking. It requires higher risk appetite from security teams. And, you know, also the understanding that introducing chaos into your system doesn't mean they will always invariably fail, but also accepting and understanding that they very much can fail. And, you know, that of course requires a higher risk appetite than usual from the security team. So this is something that would work really well for a security team that has a high level of maturity in its program already. The, you know, basics are already done. And then, you know, it's just a matter of kind of introducing more maturity. Teams like those would benefit highly from it versus, you know, teams that are still in that nascent stage that don't even have the, you know, the bare bones groundwork laid out of logging and monitoring and alerting. Again, for the exact same reasons, it would be difficult to justify chaos experiments in a small team, small security team. And, you know, for this exact reason, a lot of what I'm talking about for myself is still very much on paper still. There are companies out there that are, that have found this idea useful and interesting enough that they've taken it up and, you know, started to develop basically commercial offering around it and so on. But personally, I haven't had the opportunity to really play with it or go, you know, all out with chaos engineering within the infrastructure and the environments that I've been a part of because of late I've been a part of a lot of startups with very small and, you know, nascent security teams in their early stages. In addition to introducing, you know, chaos to the log pipelines, I'd also suggest and recommend potentially introducing them into your alerting pipelines as well. Because, again, you know, what tends to happen is, as the alerting infrastructure of a security team starts to mature, there's a lot of alerting and, sorry, not only alerting but alerting and response would both benefit from chaos experiments. And as the teams start to mature, there's, you know, a lot of incident response automation that starts to get introduced to the ecosystem because the security engineers don't want to be sitting there, you know, repeatedly, manually be answering or commenting on the exact same types of incidents every single time. And for that reason, you know, the teams might benefit a lot from automating parts of them away that the team already knows follows, you know, set number of steps or set procedures. And while automation is being introduced there, of course, every time you try to automate something it introduces its own kind of, you know, different level of complexity there as well. And a whole range of unknown unknowns. So, you know, every time automations are introduced, it might be beneficial to introduce chaos in the, you know, testing methodologies of those things as well. So, yeah, that's, I think my deck for today. In conclusion, let's join the chaos. Let's try and make, you know, these experiments come alive as much as possible by leveraging the existing frameworks and the existing tools that are already out there in the market. Let's also, you know, explore. I mean, there are a bunch of offerings, as I mentioned earlier out in the market. If you just Google chaos engineering, you'd find a whole lot of both open source and commercial offerings, and also commercial offerings that have like a premium, a free basic version that we can download and, you know, run experiments on. It would also make a lot of sense to set up, you know, lab environment to run these experiments before running them in production or any sort of, you know, DevTest or prod environments in an actively, you know, engaged and running company. And of course, I would like to, you know, shout out to Aaron Reinhardt, Tammy Butto and Nora, whose research and whose work in this area has inspired a lot of my thoughts and ideas that I discussed today. So that is, and oh yeah, I'm on Twitter. If you'd like to join the conversation, we can always, you know, connect on Twitter and have some interesting discussions. So yeah, thank you so much. This was my talk today, and I hope you found it useful. Thanks Prima for sharing your experience with us today. Cool. Thank you so much.