 Hello everybody and thank you so much for joining me here at my talk at 10 step guide for integrating security metrics into your observability stack at Prometheus Day 2022. My name is Anis Ullis, I'm the open source developer etiquette at Aqua Security before that I was working as that reliability engineer at a cloud native startup in the UK. I also have a four months old puppy that is keeping me quite busy and prevents me from traveling right now. Here she is. Now I'm really excited to present to you over the screen instead. Last year my manager at the time and I gave a talk about our observability stack, our setup of our super clusters and all of the tools, all of the operators that we use within to get insights out of those clusters and out of the tenant clusters to us to be able to act upon them. Now at the time we were talking very little about any security related tooling that we're utilizing or integrating with and that's kind of when I joined Aqua, that's how it all came about, that I thought to myself actually said reliability engineering and security cloud native security are closely related and you can easily integrate cloud native security into your observability stack and that's what I'm going to show you in those next 10 steps. Now I kind of put out these 10 steps and along the way a lot of times when you're integrating with a new area within a cloud native space with new set of tools you will experience different emotions. It might all make sense at some point and in the next moment it's just all very frustrating and nothing seems to work but I promise you there's always a finish line where things look nice or things work within your environment with new cluster. So within the next slides I'm going to walk you through cloud native security scanners and then how to integrate them into your monitoring stack and along the way I'm going to outline some of the dos and downs of that. Now the first step is understanding your need. Now your need will be different depending on the size of your team if you're an individual contributor, maintainer of an open source project or you if you're working with an alleged scale team and you actually have a security team who's also working on the security of your obligations. Now here is a list of needs of things that the vice engineering team wanted to solve for by developing their security service platform integrations. So the first thing is they want to assign ownership of vulnerabilities, meaning vulnerabilities cannot just get ignored they have a clear owner. The second thing is a global view of the security state of services. They don't want to have everything broken down in individual reports on specific applications but have a global view of those. Then next develop a dashboard for different users and requirements so different people within a team can look at different components of the security state of the services. And then the last thing is overcome difficulty to use different UIs. A lot of times when we integrate with new tools with new platforms we end up having to use a new UI and the problem of that is that we have to actually open up those new separately and introduce completely separate workflows and a lot of times we end up not opening those different tabs but actually sticking to what we already know. Now here's the link to the blog post I highly suggest the read, it's really detailed and well explained. So once we now understand our need we have to choose a cloud native security scanner. If you follow me on Twitter you might have come across this graphic of different security scanners, open source security scanners across the cloud native space. Now since I want to integrate the security scanner into our observability stack which runs as Kubernetes resources within our cluster we're going to focus on in cluster security scans. The main tools there are Trivi and Cubescape. I'm going to use Trivi because I'm biased but you can use any tool as long as it allows you to export metrics and it's producing metrics of those security reports. The additional benefit of actually integrating a security scanner within your cluster is that if everything is a Kubernetes resource you can use the same processes across your stack. So if we're using an in cluster security scanner inside of our Kubernetes cluster we can integrate that tool with all of our other tools such as our observability stack. So that leads us to step three, setting it up and making sure everything is running properly and that's probably the most difficult step that will require lots of trial and error a lot of times. So here's kind of the process identify the best installation options. I go by default with Helm. There are other installation options installing directly the manifest via kubectl and Helm and most popular options. Then the next thing is you have to decide upon the configurations that you want to have for those Helm charts, for those applications. So you can see here the screenshot of my demo application that you can also find linked at the end of the slides where I have worked some of my different values files that I pieced into my Helm charts upon installation. The next thing is you have to test the custom configurations. A lot of times things don't work how they are described in the documentation. Even for demo examples the demos in the documentation usually don't work. So you have to test things out and ensure that things are working actually together. So here's an overview of the cluster. We have our application namespace that could be anything, right? Then we have our monitoring namespace with Prometheus and Grafana running. I'm a huge fan of the cube Prometheus structure that just installs kind of everything at once, which is great. You can obviously also use additional tools. I'm going to highlight Loki at the end as well towards the end. But our main focus is really in the stock on Prometheus and Grafana and integrating our security scans within that. So our security scanner is a trivia operator. It runs as a normal Kubernetes operator like you're used to inside of your Kubernetes cluster and it monitors all of your resources or at least the resources that you wanted to monitor. And then it runs security scans on those. For instance, if you spin up a new container image, then the trivia monitor will know that and will scan it for vulnerabilities. So here's what you will see in the trivia system namespace. We have a normal replica set deployment and a service running. And that will produce metrics, metrics on your vulnerabilities, metrics on your exposed secrets, are there any exposed secrets within your cluster? And then also metrics on your RBAC configurations and on your configurations, the any misconfigurations of your running resources. Now, this is obviously lots and lots of metrics, everything really difficult to filter through. The more applications you will have running within your cluster, the more difficult it will be to filter through that. And that's where we obviously need Prometheus and then nice dashboards in Grafana. Now I could filter for specific metrics through Prometheus. In this case, I went ahead and set up in step four, a dashboard with Grafana. So I took a dashboard that's out there on the Grafana repository. There are lots of different dashboards with trivia. We also have an official dashboard with trivia that kind of looks similar to this one. And now you can see here that we have several different types of vulnerabilities. And that's our first filter point. So we can filter, for example, by critical vulnerabilities. We have a total 175 vulnerabilities. Now this part of the dashboard just shows the vulnerabilities. We can look at other parts of the dashboard, for the other reports, but ultimately, this is already a lot more easier to consume than actually looking at the metrics directly. So the next step, step five, avoid vulnerability health. Now in this case, I don't have much running with my class. And nobody have 175 vulnerabilities. Now, the more you have running, the more vulnerabilities you will probably have reported in your security scans. I took this screenshot here from Alex Jones on Twitter saying, I just give up and die. No, then he had in his cluster or like in whatever reasons he was scanning over 500 vulnerabilities in the screenshot they are divided by the severity of the vulnerability, which is already great. But ultimately, when you faced with such a large amount of vulnerabilities, you probably just want to run away screaming and never look at it again. And this is what we have to work on that this doesn't become this awful experience of the first time you run a scam, you see that you have lots and lots of vulnerabilities, and then you never look at the dashboard again, and you just ignore it. And that's actually one of the benefits within your security scanning that you don't actually need. Well, in the end, you don't need security scanning to deploy your application, you can deploy an application, and you can have happy customers, even if you have highly critical vulnerabilities within your cluster, within your application. So here's some of the possible strategies that you can use to avoid vulnerability health. The first one is ignore all but critical vulnerabilities. Just look away. No. This is a great way to get started somewhere, because ultimately, you have to get started somewhere. And that's usually the most difficult part. Where do I get started? Once you have that step, you can expand it, and you can actually look at other security issues as well. But start with the critical ones. Don't scan everything at once. There's really no need to scan all of your resources at once. Scan the most used, the most critical resources within your infrastructure, within your application stack first, and then expand from there. Filter by known vulnerabilities that have a fixed, like a fixed available. If you have over 500 vulnerabilities within your resources, but ultimately most of them, let's say half of them don't have a fixed available yet, you don't have to pay attention to them, right? Then the next thing is filter vulnerabilities by team and by application and make vulnerabilities context specific. There's no need for one person, if you're not the sole maintainer of a project, to look at all the vulnerabilities and all of the security issues by themselves at once, right? You can be responsible for a specific part, for specific type of security scan, not everything. Step six, what are metrics without alerts? It's great to ignore vulnerabilities, and that's why we want to set up alerting. This is an example screenshot of me setting up an alert that modifies me about new critical vulnerabilities within my cluster in Grafana. Now you can also set up alerts for alert manager. There is an example in the example demo repository that I'm using, so you can check that out as well. But ultimately, it's easy to ignore vulnerabilities, give them always make them scream at you until you fix them. That's ultimately the best way, since otherwise, you can just look away, right? So make it uncomfortable for yourself. That leads us to step seven, correlate metrics. It's great to have metrics within your Grafana dashboard, and within your existing monitoring stack, because that allows you to look, while you look at security metrics, to also look at the metrics of your deployments, for instance. So you can, for example, look at, well, your vulnerability reports, and you can see, oh, there are lots more vulnerabilities, what has happened there, and correlate them through other dashboards. Now the thing here is, which is important that you have to define those workflows for yourself, for your use case, for the way that you like to analyze dashboards in your application. So this will look different for everybody. You might not find this really useful. Somebody else might find this very useful, right? So it's really about trial and error, figure out what works for you to make the processes more understandable. So, step eight, some additional tips, some additional things that you can do. Some of them I already mentioned, some of them I haven't yet. So the first thing is, assign ownership, really make it somebody's responsibility. And I'm not talking here about shifting left and just making everything the engineers responsibility, because that doesn't work for security very well, very good. So make sure that different people within your team, if you have a team, are responsible for different parts of the security of your stack. You might want to have dashboards that show the overall security state of your infrastructure, of your services, but ultimately break it down into different components and assign ownership for those different components. Don't introduce too many new tools at once on too many new processes, if any new processes and workflows at once. A lot of times people will see those shiny new CNCF, collective tools and be like, oh, we need all of them right now. We need to integrate them to be on top of everything. You really do not, right? Start small, start with tools that make sense to you and maybe just implement security scanning into your CACD pipeline. First, if you don't want to have it in cluster right away or to that extent, right? I mean, you can also need to use a combination of both, but don't change up all of your workflows and integrate it straight away everywhere. Try to understand the way you integrate tools first and then expand from there. Utilize existing workflows, platforms and processes. It's really self-explanatory. People don't like to get used to new things. A lot of times new things will end up just not being looked at again, being ignored in the long term. Fail to be useful and that's why you want to utilize your existing workflows as much as possible. If somebody loves to work with Prometheus and Grafana, let them work with Prometheus and Grafana and similar tools, right? And then optimize based on what works for your team. Again, the initial setup might be the same for everyone, but everything else will be completely different. A lot of times when people have questions about the trivia operator, about trivia and related, I can't give a straightforward answer because it will be so much, so different depending on everybody's setup. So I really encourage you to explore different options in the setup and figure out, also look at whatever people have been doing. For example, how wise set up their security scanning and other projects as well. There are lots of projects that talk about how they're integrated with trivia, so you can find lots of, lots of amazing resources. Then the last step, don't stop at security scanning. You can also integrate with other security tools such as Tracy. Tracy uses eBPF to expose events on an odd level, so you can also integrate in the same way other security tools into your stack. Obviously, that is later on, maybe the next step. Lastly, here are some additional resources, some resources that I mentioned. First of all, our application security journey from the vice engineering team, then the aqua open source YouTube channel where we have lots and lots of tutorials, the trivia GitHub repository and the trivia operator repository on GitHub. They are completely open source. You don't have to sign up. You're not sending us any information related to your scans or anything related. And then the last thing is the demo project that I'm used. You can also find us on Slack. We have an open source Slack channel where people talk about trivia and other tools. And we are also going to be at KubeCon. So do find our booth, do find the open source team running around. Thank you so much for listening again. This was really amazing. Thank you so much for having me. And I hope you have an amazing day and rest of the conference and amazing KubeCon. Hope to see you in the next KubeCon. Bye-bye.