 Hi, everyone. Thanks for joining today. So today we are going to talk about the FIR in container, and how we can do the FIR in the container world. So let's start introducing ourselves. I'm Stefano, and I'm a threat research lead manager at CISDIG. As you can see from the picture, I'm from Milano, Italy. Here is Alberto Pelliteri. He's a security researcher at CISDIG. And he comes from Turin, Italy as well. And he works at CISDIG as well. So for who doesn't know CISDIG, CISDIG is a container and cloud security company. CISDIG donated FICO, a runtime security project to the SCNCF back in 2016. And it's now the sandbox runtime security project at the incubation level. And this is in the process to became a graduated process. Talking about FICO, we are both FICO contributors and are also active in FICO community. And I'm also a FICO reviewer for FICO rules. Here is the agenda for today. So we're going to talk about DFIR, especially we are going to see incident response and digital forensic. And then we are going to cover and refer to NIST incident response lifecycle. And then Alberto will go through some DFIR container methodology, techniques, and also tools, specific tools for DFIR in containers with some demos. And then we will wrap up the session with some main takeaways. So let's start. Let's start talking about DFIR. As we know, DFIR puts together two different areas, digital forensics and incident response. As we know, digital forensics focus more on collecting and analyzing data, user activities, and other piece of digital evidence to understand what happened on a machine or who may be behind specific threats or the specific attack or whatever we recorded in our environment. All these activities need to be done following best practice and specific methodology to maintain the chain of custody so that the evidence are legitimate and can be used and presented to the court in case of legal proceedings. Incident responses that are focused on preparing, detecting, containing, and recovering from a data breach or a security incident. In early years and early days, as we know, these two processes were completely different and separated. However, of course, because they have different goals. But tools, processes, methodologies are pretty similar. And now with tools like EDR, XDR that are evolved, of course, they empower the incident responder to do further actions like proceeding with investigation and also performing other actions. So it makes sense now to put them together under the same hat and referring directly to DFIR. And when we talk about DFIR, we use to refer to the NIST incident response lifecycle step. And here are the four main steps that are pretty well known. Of course, today we are not covering all the technical and non-technical aspects from NIST. I leave the link behind in the slide so that you can check out the official NIST documentation if you are interested and you want to know more about it. And of course, today we are going to cover the technical aspect on how we can do DFIR in containers, which are the techniques, the specific techniques that we need to know in order to do that in container worlds. And so this is the time of the presentation where we start saying the best practices. And yeah, sure, you need to follow the best practice and you need to check for misconfiguration, right? So I know it's kind of boring, but we need to do that. So jokes apart, when we talk about incident response plan, it's always important to mention that these need to be prepared in advance. His fundamental attacker aren't going to ask you if you are ready to receive the attack or, I don't know, we are in, so are you aware that we are already in your environment? So we need to be prepared of that. And we need also to keep in mind that most of the or part of the incident response plan are non-technical. So there are some aspects like contacts, people contacts, people that need to be involved, needs to be defined in advance, and people from different teams, they specifically need to know which people they need to contact and exactly what they need to do in specific and which action they need to do. So all these aspects are not, are more processes and is not technically related, but is something that it needs to be set. And of course, if we talk about technical aspect, as we said, there are specific tools that we need to run and the list of the tool required might be set in advance and also all the knowledge about how to use the tool or specific flag that you need to use, they need to be, all the knowledge should be prepared and to be there before the attack. Because of course we can spend time on understanding how this tool work or, oh yeah, I need this tool to do, I don't know, an image, a forensic image of something and we need to understand how the tool works and all these kind of activities. So timing is fundamental and we need to be prepared. And of course, even if you have an incident response plan set, we need to make sure and you need to make sure that this is up to date. I mean, people changing companies, tools change pretty quickly, as we know. So all the list of tools that need to be used should be updated frequently and also the people that need to be involved need to be updated frequently as well. Okay, that's fine, so we covered what we need to say and of course, the FIR is a pretty well-known practice and well-known methodologies and today we are not going to cover the well-known tools that of course are still valid for containers. So we are not saying they're not useful anymore. Of course they are, but they are pretty well-known and we are just one to cover and we're just to focus on specific tools that can be used on containers. So to do the FIR activities in containers. This happens also in other technologies like IoT or ICS, if you think about it. There are specific tools that need to be known and need to be used in order to perform some activities on the FIR. So let's start talking about the FIR in containers. So we all know containers and we are pretty familiar with them and we like a lot. It's super easy to deploy code, it's super easy to ship code that we want in our environment and deploy it. So of course we know all the advantages that container can bring, but of course we also know that there are some challenging that we need to face. And especially for the FIR, we know that containers are dynamic and are ephemeral. So doing activities like the FIR in such a dynamic environment could be pretty challenging and of course we need to respond as quickly as possible because some activities that Alberto will show now needs to be done when the container is live and need to be there. Then we know containers are ephemeral and super quick to destroy and remind me some information. So it's really important to gather this information and then perform all other activities. So I let Alberto deep dive into the FIR and all the techniques and tools to use. Thank you, Stefano, I'm the lower one. So for this presentation, we want also to adopt a practical approach covering the previous steps that we have seen about NIST incident response plan. And to do so, we are going to cover all of those steps outlining the main aspects that may be useful in the containerization words and along with tools, techniques, and so on and so forth. And so for this reason, let's begin from preparation. So preparation is maybe one of the main aspects of incident response methodologies because a good preparation allows you not only prevent the incidents, but by hardening your environment, but also to enforce the right capabilities so that an organization is ready to respond quickly. We can talk about facilities, communications, hardware, software, whatever you want, but we're going to skip all the organizational and traditional steps like set a world room or plan on site codes and so on and so forth. But we're going to focus our attention on open source tools that can help incident response team to detect red flags, facilitate attacks, investigations, and response. So the first tool, I mean, the first kind of tool that I want to stress is container runtime security tool. Falco is maybe the first CNTF runtime security tool that allows you to spot threats at runtime. See whenever there are some kind of threats that are running in your Linux environment, in your containers, and also to spot audit logs and some kind of resources and requests that are made to the Kubernetes API server. So you may want to spot this kind of threats at runtime. You may also want to adopt other kind of open source tools and CNTF graduated projects like Prometheus in order to monitor your resources and see if there are some pick of loads, network changes that may be relevant for your investigations or also some other solutions like logging solutions because one of the main best practices and that you need to enforce during the preparation step is also to log, I mean, as much as you can from your applications. So that you can, I mean, try to have all the kind of information that are needed. And then with tools like Fluentbit and Fluentd, you can try to forward these logs in just them, parse them so that you can have all of them with, for example, Falco Alerts in one single logging management platform like the ArcStack that is another open source project like OpenSearch and so on and so forth. So in order to address the failure in containers in a realistic way, we're going to adopt Kubernetes because I think that most of you are familiar with it and also because this is one of the main orchestration tools used by the open source community. And as I mean, as many of you already know in Kubernetes, we can run the, we can run pods that embeds containers in there and so we can also perform the failure in this kind of container. So let's start from the demo that we have set for this presentation. Here we have in this case, these Kubernetes clusters where we have few nodes, where we have demons that like Falco, Fluentbit and Prometheus in order to monitor our resources, we are going to forward this kind of alerts to the single logging management platform like ArcStack. But as you can imagine, in order to perform the failure in a container we need a workload that can be exploited and later analyzed. So let's teleport to the past and let's imagine that we are in 2021. I know that you're thinking that I'm going to talk about log4j because I'm not, I'm bored about it. So let's talk about another vulnerabilities that was disclosed in 2021 as well that is the Apache HTTP server that was a path traversal vulnerability that under some circumstances allowed also to perform remote command executions. And so in our cluster, we basically exposed our workload, our Apache HTTP server with a Kubernetes log balancers that could be exposed by attackers, that could be exploited by attackers. So let's now where the incident response team had. An incident response team is always monitoring alerts based on fault code detection, I mean alerts like the ones based on fault code detection that we said, to be alerted as soon as possible and to promptly respond to incident. Let's jump into ArcStack. And in this case here, we have some logs that are related to one of our Apache HTTP pod where we can find something related to launch ingress remote file copy tools in container that is kind of suspicious. I don't think that it happens every day that there are cool on that we get executions that are run at runtime in your container. So this is a red flag. If we go ahead, we can also see some other kind of alerts like a shell was spawned by the Apache HTTP process that is kind of suspicious as well. And also attempt to steal sensitive credentials. But focusing on our attention to the launch ingress remote file copy tools in containers, we can see that it was executed a cool W get that tried to reach an endpoint that is related to the IP address that you can see here, one, nine, four and blah, blah, blah, blah. Where you were basically, that was basically used in order to download a malicious payload that we will see later. So let's now go ahead and let's see also the logs that were forwarded by Fluent Petey in our log management platform that are related to the impacted pod. If we see the logs, we can see that for example, there were a bunch of get requests on top of these slide. And then there are also some commands that were not performed successfully. There are also some other kind of, I mean, operations that were not permitted. And at the end, if we go ahead, we can also see a post request that is kind of suspicious because it basically tried to, it exploited DCGI being in order to spawn a shell. Other suggestions, by the way, may also come from Prometheus, as I said before. Because Prometheus allows you to basically, for example, query your pod that you have in the specific namespace, for example. And you can see, for example, from the memory used, that at the beginning, there was this normal amount of memory that was used by those pods in that namespace and at a certain point in time, there was this pick of load. I mean, this is just an example, also because this also means that I'm a bad programmer because I didn't set resource quotes or resource limits in my workload. But there are other kind of indicators like network bytes that may be explanatory for your investigation purpose. So, let's now continue. Okay, yeah, I love the office. Now, this is one of the main interesting part because, I mean, instant response team should always be prepared to such kind of events and should be able to respond as soon as possible to the alert that happened. Also because in your orchestration environments or whenever you have a container that has been impacted, you need to respond as soon as possible because, as Stefano said before, they are ephemeral. And these kind of containers can disappear and you may lose the kind of informations about the attack that happened in your environment. So, let's now go ahead and let's talk about the real response that is engaged by the response team. So, you may want to contain to evaluate your education and also to recover your infrastructure. So, contain basically means isolate the attack because you will want to avoid that the affected pods or container may impact other resources. If you are, for example, in a Kubernetes cluster, the pod that was impacted may have access to service account. Those service account may grant you access to some requests to the Kubernetes API server and so, for example, the attack can generate a mess. And of course, you need also to insert availability and continuity to your resources, to your services that you provide. And you must also assess which resources have been impacted. For example, your containers may cause pod escaping, container escaping, grant to the attackers, direct access to the nodes. And these, of course, can create some other problems. And what is also important for the forensics step is to collection all the evidences about the attack. So, as always, the first and the main step is to always not shut the worker node volume because these grants some forensics analysis that can be done later during the post-mortem phase. But you can also enforce, for example, some other commands like commit and push in order to store the container image that was impacted and also check point the container that we're going to see later. And at the end, of course, last but not least, you should also fix at the end of this step of this containment eradication recovery, fix the entry point that was exploited by attackers. So if it is a misconfiguration or a vulnerability, you need to fix it, otherwise you have to mitigate if, for example, the patch is not available. So before jumping into the crime investigation scene, I will like only to stress that there are some kind of useful tools and techniques that you may want to engage whenever there are breaches in your containers. But I want also to stress that many of these tools continuously evolve because there are some kind of tools that, for example, you could use few years ago in Docker, but this, for example, cannot be used anymore in some kind, I mean, up-to-date Kubernetes cluster due to the Docker sheen being deprecated. So there are, of course, you can interact, as always, with the container engine or with the Kubernetes API server to do some operations that you're going to see in a while, but you may also want to adopt tools like container diff in order to check the differences among two different images, or Docker Explorer and Container Explorer in order to check any kind of differences. I mean, in order to recover the state of our previously snapshot to the worker node volume so that you can see which were the previously running containers. By the way, let's start now from the isolation step. So in this example, as we said, we are in a Kubernetes cluster and we know which is the pod that was impacted and the first kind of thing that we want to do is, for example, to label the pod and then apply some network policy in order to isolate the pod. In this way, if there was an attacker that, for example, exploited a river shell, it will be cut off the pod itself because the network policy doesn't allow any kind of ingress or egress traffic. Later, you may want also to code on the node where the pod was scheduled. So in this way, all the new kind of workloads that will be deployed in your cluster will be scheduled elsewhere, so in other nodes. And doing so, you are now able to start some other forensics processes, like, for example, snapshot the worker node volume or restrict the network access to the worker node itself. You will want to, for example, remove the IP, the external IP that can be used in order to reach that kind of worker node and create an intermediate VM. But all these kind of steps are mainly related to traditional step and our focus is right now to dig deeper into the investigation of the container itself. So let's now assume that we haven't forced some of these steps that I've just mentioned and that I'm going into the machine, into the worker node that was scheduled, that has scheduled the impacted pod. So we are now into the worker node itself and one of the first things that we want to do is, for example, to interact with the container in time. Of course, you must be sure that, for example, there was no pod escaping or container escaping, but we are going to see the eradication, the eradication step later. But one of the main things that you may want to do, for example, without interacting directly with the Kubernetes API server, for example, is to interact directly with the container engine, see which are the logs that we are already seeing with Fluentbit, see which are the running processes. And in this example, you can see at the bottom, the Kinsing malware. You can also see that there is, we used also the CTR snapshot diff that basically allows you to check which are the changes in the file system. So if, for example, the attackers were able to download some files in the file system and if there were some changes in the file system itself, you can spot them from there. And you can see that in this case, we're engaged the TMP path where they downloaded some suspicious file. You can, of course, commit the impacted container so that you can push the image in a remote registry or store it as an archive. This kind of, of course, committed and stored image can also be restored later from a sandboxed machine, from an analyzer machine that is isolated from your code environment. And for example, in some, I mean, some strange other cases, you may also want to use a few more containers that became stable in version 1.25 and that basically allows you to mainly debug the container state or some attacks evidences. They can be useful because they are simply some kind of strange containers that you can attach to an already existing code, accessing directly the same process namespace of the impacted container. And you can see, for example, which are the running processes. Of course, there are some disadvantages because if there are some threats, they can understand that you jumped into this kind of process namespace, container namespace. And so they may be dangerous sometimes, but useful if, for example, you have some distressed image that don't have any kind of shell or tools, utilities. And in a post-mortem analysis, you may also want to, for example, pull or extract the images that you previously committed and pushed in order to download it into an analyzer machine. You can so, in this way, you can see, for example, which were the changes in the file system. Or, and this is one of the best practices, of course, using this snapshot volume of the worker node that you have previously created, you can use that snapshot volume so that you can mount it into an analyzer machine. And from the analyzer machine, you can use tools like container explorer or Docker Explorer if you're using, if you have previously used Docker in order to check which were the previously running containers. And I have here also this kind of video where I'm showing how to use container explorer. It basically allows you to check which were the previously running containers in the previously is not shot to the volume. And from that, you can understand, for example, that there was the Apache impacted container. You can use the mount functionality so that you can mount the previously impacted container into the analyzer machine container. And from the analyzer machine container, you can run all the forensics processes that you want, all the forensics steps that you want. So at the end of all this video, you can see that I was able to successfully mount the previously impacted container file system and then see, for example, the TMP file system that was impacted by the affected container. But, okay, let's go ahead. And another interesting functionality that works mostly in Podman, but not in all the other container engine is container checkpoint. Container checkpoint allows you to freeze the container state. So this also means container processes, container executions. So the file descriptors open by the processes themselves. Also the TCP connections, for example, stored them into an archive. And then you can later restore the container execution in a separated environment. So this is quite useful if you want to understand also from time, which kind of execution were previously launched into the container that was impacted. By the way, this kind of feature, as I said, works mostly in Podman, but there are container engine that are trying to integrate it as well. I think that there is also the Kubernetes checkpoint feature that is a beta, a beta feature right now and allows you to checkpoint a container calling the kubelet request. It's still a work in progress, by the way. So at the end of this containment collection phases, of course you can do all the forensics processes that you want, but it is also important to understand if the attack spread elsewhere. So containers, as you know, maybe run with privileged permissions, with sensitive mounts, and so this may allow attackers to escape from the container, from the pod to the host. You should always take care about the security settings, the security context that you create in your workloads. Here I posted, for example, two kinds of security context and their analysis that was done by Cubescore, that is another open source project. Of course, if you are talking about pods, by default, are bound to a service account token, but I mean, most of the times, these service accounts token are pretty generic and doesn't allow you to do that much, but if you create custom service account, you can communicate with the Kubernetes API server, create some new, make some requests to the Kubernetes API server, create new resources, and technically also do some lateral movement to the Kubernetes cluster. And of course, if you're talking about, for example, Kubernetes clusters, I mean, most of the time, whenever we create a Kubernetes cluster, its worker nodes are bound to IAM roles that have some privileges. And if you modify those privileges, assigning a restricted access to some cloud resources, these may grant attackers to attack your cloud resources as well. And yeah, of course, the recovery, that I think it is one of the most implicit kind of thing to do at the end of an incident. So you have, of course, to fix the vulnerability if possible, sometimes it is not always possible and you want to enforce some mitigation, like for example, work load, like for example, delete the workload, if of course it's possible, according to the services that you are providing or the workload that you have, or also restore the container, hand up that the attack doesn't happen again in the future, or enforce some playbook of actions whenever there is a kind of rule that is detected by, for example, FALCO, you can enforce some specific playbook of action in order to say, hey, whenever there is this kind of rule that is triggered, kill my container. So in order to wrap up this session, I'm now leaving the floor to Stefan again. So thanks, Alberto, for all the information shared. And of course, the last step that required in the incident response plan is post-incident activities. So we took all the action needed to contain and eradicate the threat and we restored and recovered the environment. And this is now back to normal and it's all working great. So that's fine. And it's now time to do some retrospective to get lesson learned from the incident. There are some things that probably worked, other stuff that probably didn't work. So it's important to understand why they didn't work and what didn't work, and perhaps understanding how we can do better. So the output of this phase is literally a kind of checklist of points and a list point of what we need to improve. And this is basically the input for the update of the incident response that we have. So based on the feedback, based on what we discover during the incident, we can update our incident response plan, make it better and be sure to avoid the same incident in the future. So of course what happened, we can do much, but we should take these as a good opportunity to do a better work in the future. So just to wrap up what we said and just get some main takeaways from this talk. So as I said, don't be caught unprepared. Prepare your incident plan as soon as possible. As we said, this is fundamental. You need to be prepared if something happened. Of course, we need to respond as soon as possible. As we said, container are ephemeral. Container may not be there. Some activities needs to be done when the container is still alive. So be prepared and act as soon as possible. The other point, as Alberto said, container aren't isolated by default, right, as we know. So there might happen like container escaping and the attacks move to the host or to the node or in other part of your infrastructure. So don't just focus on the container that was exploited during your investigation, but go farther and there might be more to investigate. So be prepared of not just looking at the container, but also check all the rest of the infrastructure. See the lateral movement if there was something and be prepared to go farther in the investigation. And also it's important to have fully monitoring your infrastructure. So be sure to have all the visibility that you need in order to perform all the activities that we saw before. As we said, there are, and as we have seen, there are specific tools that we need to use and we need to know how they works and which tool we need to use. As we said, there are a bunch of different tools, but we need to stay update and we need to know how to use them before the incident happens. Also it's actually important is actually the best practice to simulate breaches to see the response of the plan, of the people that are involved to see if it's working or if you need to involve different people perhaps or update contacts or whatever. So check your plans. It's not enough just create the plan and say, okay, we have done it. We need to check it, see if it's working. If not, update it and make sure that this is working great as you expected and then be ready for the real incident. And last but not least, as we have seen in order to do this kind of activities or especially in the FIR for containers but same for other technologies, we need people that knows and have background on Kubernetes containers but also instant response knowledge. Sometimes of course we don't have the knowledge inside our organization or inside our team. So it's important to hire or engage someone else that has this kind of knowledge and can be and can do all these activities in a good way. As we said, also in digital forensic, we need to maintain the chain of custody. So it's important to do some specific activities in following best practice in good way so that we can use this data in court or wherever it was. So it's not, we need to have this knowledge and we need to do the old activity in a proper way. So I guess that's it from us. Thanks for joining. And I think we are actually run out of time but if you have any question we are here, we can answer offline if you have any question. Thank you very much.