 My name is David de Torres. I work as engineer manager at CISDEC, but I am also maintainer of the open source project from KADO I.O. And I come here with my friend. I'm Mirko de Dordzi. I'm a software engineer at CISDEC and I work on the read path of monitor. Well, as I said, this is an exploratory talk and we want to introduce the concept of defensive monitoring. But first, we want to give some context because here, for sure, there are some people that works in security and other people that works in monitoring. And we want to introduce a bit some of the tools that we will use and compare first to have some context for the talk. How many of you have used in production Prometheus or have some knowledge about Prometheus? Well, great. I'm happy to see that. Well, I will pass on these. Prometheus is a CNCF project. It is the standard for monitoring metrics in Kubernetes. And what it does is simple. It gets metrics and stores and visualizes the results of the metrics. I will not get into the details on this. But what you need to know is that almost everything in your cluster exposed Prometheus metrics, especially the Kubernetes elements. So, knowing this, you are interested in seeing the metrics. They seem something like this. Prometheus is using a human readable format that is called Open Metrics that you can access to any of these endpoints of these Kubernetes elements that are exposed. Metrics and you can see something like this. This is great because you can see what they are exposing, the name of the labels, the value, whatever. But this is not useful. What Prometheus makes is that this reads every 60 seconds on these endpoints, stores these and then through a query language that it is called from QL. So, you can query that as a database and make aggregation, make maths and make a lot of different manipulation of data to get different results. And this is something that you can see in Prometheus, for example. So, there is another tool that we will talk about during this talk. So, let's introduce Falco for whoever hasn't ever heard of it. It's an open source CNC project that joined the CNCF a couple of years ago. And it was born as a way to look into what's happening inside of a kernel of a machine, create rules upon these events and alert. As an example, you can create an alert whenever someone opens a shell inside of a container or through some plugins, you can also create rules based on Kubernetes events and AWS events. For example, if someone is modifying a config map with some determined values, you can create alerts on that. Before starting to talk about what we, a couple of our ideas, we want to introduce this talk with a quote. When the only tool you have is a hammer, everything looks like a nail. Now, because me and David use Prometheus almost on a daily basis and this is a Prometheus talk, we should probably change it to when the only tool you have is Prometheus, everything looks like a metric. And I think this summarizes quite well what we want to do and propose with this talk, which is to use Prometheus for something that wasn't meant to be used for. And try to leverage the information we have whenever we monitor a Kubernetes cluster and open the beta about how we can use this information for other purposes. In this case, security. And why did we choose Prometheus? There's a couple of reasons. We can see a feature metrics of some of the features of various toolings like metrics based monitoring runtime security with Falco and image scanning. And there's a couple of fields in which Falco obviously is winning. But there's an important one we think, which is historical data in past context, as in Prometheus records values over time. And it allows us to query in the past and see what was the behavior of a system in the past. And having historical data allows us to do anomaly detection on this data. And also another thing is that Falco has a really low level view of your system. And Prometheus, whenever we scrape application level metrics can give us information about the applications. And plus it's really good for monitoring resource usage and detecting some possible events that are happening inside of the cluster based on the users of the resources. Well, given this context, we want to introduce three areas that in exploratory way, we identify that we can use metrics to improve or to complement what Falco and other engines or other tools like image scanning are being used for monitoring. The first one is using, well, first, I would like to show you some of the capabilities that we can do with metrics. It is just a few slides. There is a full talk about this from one year ago in Prometheus day. But just to give you some idea of the power of from QL here, we will use this kind of things later. Here, for example, we are using in two lines of from QL, something called set index. This is just as simple as taking the average and adding two times three times the standard deviation. We can find anomalies in the data. In this case, for example, we are finding if we have different parts of the metrics that has this value and suddenly one of them has this value. We can think that this is an anomaly. This is group anomaly. But we can do this in time and say, okay, if in the last hour or in the last day or in the last week, this metric has this value and then suddenly there is an spike. Maybe this is something that is worth to be studied. And this can be done also with from QL. But what happened if this spike happens every day or every week or every hour because there is a cron job making things. And it makes the CPU or the networking or whatever it makes in the spike. And I don't want to detect this as anomaly. Well, with from QL alchemy, we can do some seasonality profiling. And we can also do anomaly detection with seasonality. I will not get into details to this. This is just to show you what is the power of from QL and time context that Prometheus gives. So let's explore one of the first, the first area that we identified that we can share with you about defensive monitoring. And it is threat detection. And we will give some examples. These examples are meant to be used as they are, or you can change it to use group anomaly detection or other things. We will go from more simple to more difficult ones. I will introduce the first one here, for example, is checking if in the API server, there has been requests denied for quotes 401 or 403. These are unauthorized attempts to access the API server. But also, we can check if there has been an authorized attempt to access resources like secrets or config maps. Plus, in the third one, we can see when someone has tried to modify the cluster at some level, making a create, update, patch or delay. Why this is important? So the threat we're modeling here, and we're trying to detect, is someone finding our Kubernetes cluster open in the world and trying to access it and fetch some data from it. In this case, the first query just detects any type of cause that haven't succeeded, while the second few allow us to actually see more precisely someone trying to fetch a secret, which could contain some tokens or API secrets that possibly might even give access to an attacker, even to resources outside our cluster, and also someone trying to actually modify it and alter it by creating some parts, changing them, deleting them and so on. Why is this interesting to do with metrics? But you can say, okay, there are scenarios, for example, that you cannot access directly the node of the API server, like managed services or secure environments. Of course, with Falco, you can do that with plugins. Falco can access the audit log or even the audit log of managed services like EKS or GKE. But this can also, in the metrics, we can add these five minutes granularity or the minutes granularity and different thresholds to reduce the noise, because it can be something punctual, it can be an application that is bad configured or misconfigured and is trying to access every minute or every five minutes, or it can be a real attack. So with PromQL and Prometheus, we can try to aggregate or do some fancy things to reduce the noise. Let's move for the next one. In this one, what we are looking is at the ingressers that has been created in the cluster now five minutes ago. But why ingressers are important? Well, of course, if someone managed to get access to our Kubernetes cluster, he might create an ingress to allow him to connect in some other way in the future and manage to get a persistence connection to the cluster. Let's go to another one. In these two queries, we are checking near to expired certificates. In the first one, we are checking the expiration date of the certificates of the Kubernetes elements that are talking with the APS server. And in the second one, we are checking the certificates of applications. But why certificates are important? So in case our applications are using expired certificates, someone might try to do a man in the middle attack and terminate the HTTPS connection early with their own fake certificates and then either sniff the traffic or even alter it during the time it has been transferred. Here we have a set of queries that are looking at networking errors. They are almost the same. And here you can see, for example, that the error rate of the error code 404 is bigger than a threshold. In the next one, the only thing that we change is the error code that we are detecting. Now, in this case, it is the 403 error. And in the next one, it is the 500 error. But why are these important and why these are different use cases? So with these few alerts, which are actually quite similar, we've detected three different threats. In the first one, for a 404, an attacker might be trying to fingerprint our cluster, detect which applications are running, maybe even be able to find the versions, and then look up into a CV database and use some public available exploits. In the second alert, 403, an attacker might be trying to boot for some authenticated endpoint. We might be able to detect that. And for the 500 code, someone might try to run, for example, an SQL map or some other SQL injection, and in case they are able to find an injection, they might be playing around and breaking the endpoint for some time, and this can allow us to detect it. Another important thing is that here we set a standard threshold for 0.3, but with Prometheus, we have the ability to actually compute a variable threshold in real time through time and group anomalies and actually update this threshold as the behavior of our cluster changes. And this is something so interesting because this is something that we will use in the next samples. For example, this one. We try to, the great part of the query, it is fancy maths that if you are interested, you can look at it, but if not, no need. Okay, this is just calculating the threshold through the set index that I introduced in the previous slides. What we are looking here is all the Inode Usages of all the volumes in the cluster and looking if any of them has a higher rate of usage. And in this case, maybe an attacker is trying to do a Simulika tank, a brute-forcing file Simulinks, or maybe there are DDoSing, and we can attack this through this alone. This is a similar one, but in this case, imagine that you have a deployment or an stable set or a demo set that it has one per node, and you have a lot of pods around the cluster. And what we are doing here, making some black magic with the purple part here of the query, what we are doing is making groupings. So what we are saying is, take all the containers that have the same name around the whole set of pods that should behave the same, and let's compare if one of them is using more CPU than all the rest. And in this case, obviously, normally we would expect a deployment to have more or less the same CPU through all the replicas. And if one of them is having a higher CPU usage throughout time and throughout a long period of time, it might be an indication of a crypto mining being installed. And of course, usually replicas are behind a load balancer, so most often one of the replicas is going to handle a single request and the CPU might jump or even the memory. And using Prometheus, we can actually reduce the noise and smoothen out the curve through the rate function and only check the average in the last five minutes. So if the CPU is higher for a prolonged amount of time, it might be an indication of something going wrong. Now we want to introduce two scenarios that have similar threat detection scenario, but they are implemented different ways. First, we are following something similar to the CPU thing that we saw. Imagine that you have a deployment with 20 replicas, 50 replicas, or whatever, behind a load balancer and they are receiving sending data. Normally, they should all receive and send the same amount of data, but in this case what we are doing is saying, okay, let's detect if one of the pods among the 20 or 50 that I have that are receiving the same request is sending more data than the rest. But this is a group anomaly. As we saw before, there are other kinds of detection and we can play a different game and say, okay, let's see if in each of the pods during the last hour the response size is this big and suddenly I have a spike this big. And in these two scenarios, what are we detecting and why this is important, Mirko? So both these alerts are detecting possible data filtration. In the first case, maybe someone manages to get access to the database through some service or exploit and they might be dumping the entire database or at least on tables. And this allows us to see any spikes in the network traffic and regarding the second alert, it's actually allowing us to detect responses which are abnormally big and might indicate someone fetching more data than they are supposed to. For sure there are more use cases, for sure there are more metrics that we can check, but this is an exploratory talk. We wanted to open some mines. Any of these examples that we showed you, they can be tweaked, we can use them with static threshold like we saw in the first examples or we can make them as group anomaly or time anomaly or we can make regression functions if we want to get more fancy. Let's move to another topic that we identified under the umbrella of defensive monitoring that can be interesting. Do you want to tell us about that? Yeah. So, sorry. We skipped a slide. So essentially, what we are trying to do is taking the data we already have in Prometheus because most of us are already monitoring Prometheus and take some of the metrics and try to manipulate the data to gather more information than we have right now. By default, CubeStateMetrics, which is the most common exporter, exports a metric called CubePodInfo. It's a special metric because the value itself doesn't have any information. It's always a 2-1. And the important thing is the labels in the metric. By default, this metric is omitted by pod and it contains some labels regarding the pod name, the namespace, and who created it. So if it was a replica set, a daemon set, or deployment, and the name also of the deployment. And the power of Prometheus that we can take this metric and look into the past by five minutes, an hour, two hours, or even two weeks, and see how this data has been altered. And if we take this data and we process it through some scripts and feed it to some front-end diagram generating tools, we can actually draw the topology of our cluster as a graph. And depending on how we collect the data, we might not be using CubeStateMetrics. We might be using some agents or some other exporters. We might be able to get a different amount of granularity for this tree. And why is this possibly useful for security? Well, whenever we monitor our Kubernetes cluster, we're gathering probably millions of metrics. And these metrics can be quite different, quite difficult to understand, even if we aggregate them at a really high level. So we can take this CubePodInfo and process it and spew out essentially a depth of our cluster in time. This is an example with a single cluster, two workloads. In one of the workloads, we have four pods and in the other three. And we can see that in one of the workloads, two of them have been deleted, which might be normal because we downscaled our deployment. While in the other workload, we have created the pod and one of the pods has been changed. And this can be useful as, for example, for a blue team wanting to understand what happened after an attack, what was changed, what was created and what was deleted, and how the attacker went through the process of exploitation. And we don't have to stop at Kubernetes level metrics. We can also use a service level metrics. For example, if we're running a service mesh, like Istio or LinkerD, most of these service meshes emit metrics regarding the requests that are being issued. And most of these metrics contain information about the source and the destination, either at a pod level or a service level. And as we did with the Kubernetes info, we can take these metrics and process them and spew out a diagram, now not representing the topology, but in network topology for a cluster. This is only a screenshot, so we can hover over the nodes. But if we were able to hover over the nodes, we would see that the triangle in the center represents a Kafka cluster. We can see the master talking with the followers. And this is a Zookiper cluster right here. And we can see both the leaders talking to each other. And in this case, it can be potentially really useful because we can see at a glance who's talking with who. And if for some reason, Zookiper is talking with Mongo or Postgres or some other service that you shouldn't be talking to, it might indicate an attacker getting access, for example, RC and executing queries to the database and exfiltrating through some other service. Well, as I promised, we are presenting you three different areas that were identified for defensive monitoring. The first one is who is watching the watcher. Monitoring Falco. The first idea that you can get of monitoring Falco is monitoring resources, getting ensured that Falco has enough memory, enough CPU, it does not ruffling. But let's go a little bit further because Falco can expose metrics, not directly, but there are two ways to expose Falco metrics. As you can see here, the first one can be using the Falco exporter. Falco exporter is a great tool that you install as a sidecar in each Falco, so you have one per node, and it expose directly Falco metrics. There is an alternative option that is using the Falco sidekick, and it has some advantages. First one is that it is just one per cluster, so the resource fingerprint is much lower. It also expose the same kind of metrics that the Falco exporter plus extra metrics for the outputs that it gets, and it allows you to add extra labels for the metrics in case you want to add some labels for different environments or whatever you need. Why is this important? What can we do with this? Well, again, with all these examples, you can do... This is a simple threshold. We want to focus on what you can do with the Falco metrics, but you can use also anomaly detection, time anomaly detection, whatever. For example, we can check how many times a given rule has been activated in the last five minutes, for example. This can be useful to reduce noise. As we said, maybe you have something that is misconfigured and is trying to do things, and it is from time to time trying to do something and you want to remove that noise, or you want to see what are the patterns of possible attackers, when they are made, when they are performed, what time of the day, what day of the week. You can do these kind of things with this. Also, you can say how many critical events am I having, and even make cementation to compare how many low events, critical events, medium warning events. So, this can help you make your reports. Also, you can reduce noise. You can make also time anomalies, group anomalies to see which is the rule that is more activated rather than the others. Also, with the Falco sidekick, you have the ability of checking if one of the outputs is broken. So, it allows you to identify when you are not getting notified when you should. And this is who is watching the watchers as well. Why I'm not getting notifications. Maybe nothing is happening, or maybe the channel is broken. So, I hope you enjoy the talk. I hope we can open some minds and get some awareness about why metrics can be a complement for the security tools that we are using. We would like to hear from your feedback, from your ideas, and explore this new area of defensive monitoring. Thank you so much.