 Welcome back everyone of the lunch, so we'll start with the first session post lunch. So I'm very happy to welcome Neelish. So Neelish is going to give a talk on securing Kubernetes best practices and effective strategies. So he has over eight years of industry experience and I think we'll be really getting a lot of insights from him. Thank you. Thank you and hi everyone. So I'm here to talk about securing Kubernetes and what are the effective strategies and best practices that you can do. So I'm not from around here, I'm from Sri Lanka, so I flew yesterday and I also run a similar community called Cloud Native Sri Lanka back in Sri Lanka. So let's look at the Kubernetes attack surface. Before I do this, I just want a raise of hands. Who runs Kubernetes, like manage Kubernetes clusters, show of hands? That's a few of you, right? What do you use? OpenShift or Rancher? Or just Cubed here? Rancher? All right. Okay. So this is actually relevant to a lot of people. So the people who run in the cloud manage GKE, AKS, EKS, they only have to manage their application vulnerabilities, but the people who run their own clusters, the first entry vector is your control plane components. You have the API server running, you have the HCD API, you know all of your control plane components are over there, so you have to protect those. The next one is the application vulnerabilities. If somebody hacks into one of your containers, they can execute a remote shell, talk to your services, do mess up your cluster quite badly. Then let's say your VMs are on a public network and connected over the internet or something like that. So then Cubelet API is another place that you have to take a look at and protect because that's the place where all the container, the Docker daemon and everything that runs is managed by Cubelet. You can do quite a lot of damage if you get a Cubelet access. There's a lot of light coming from there. Then finally accessing your virtual machines. If somebody hacks in against SSH access or something into your virtual machines, they can again ping your services and because the network is still there, so they can do quite a lot of damage on that front as well. Okay. So let's start with protecting your applications first because that applies to everyone here. And then let's look at how to manage the cluster level components. So the first thing when you are doing application security is securing your CI CD pipeline. That's the first thing that you've got to do. Then in CI CD pipeline, there are a few stages. There's a source stage, then there's build, then there's test, and then you deploy in a very high level. And across all these stages, you have auditing and monitoring. Then at build and test, you have static security testing. And across again, all the stages, you have vulnerability testing. And then at deploy stage, you have runtime security. So at the source level, I'll just talk about auditing and monitoring for a second because I don't have a slide on that. At the source level, it's basically developer best practices, putting proper PRs and doing all of that which enables you to do proper auditing and monitoring. At build level and test level, it's the way you run your static scanning and then log them and then create GIT issues if there are problems, if you ignore issues, the whole process that you do. And at deploy level, basically the logs of your runtime security. So static security testing, the first one I'm going to talk about is image vulnerability scanning. And I believe all of you know how images work. Image has a base layer. So each image is built with layers on top of them. So if at least one layer is, you know, Compromiser has a vulnerability, it would go on to the other layers, as you can see from the example. So by exploiting a vulnerable image that is running, you can do quite a lot of damage, you can do privilege escalation, you can get remote shell access, information can be leaked. Like for example, if you are, I've seen certain companies like bake credentials into the image rather than mount. So I've seen that quite a lot. I see some of you are smiling, maybe you're doing that, don't do it. So DDoS attacks and all of that, right? So yeah, so to avoid this, so basically to avoid information leaks, basically scanning is not going to help you. It's a developer practice that you have to do, don't bake credentials into images. But other vulnerabilities, you can quite easily scan and fix using tools like Clare, Trevi, Snike and few others. And also a rule of thumb is use small images like Alpine images to start with or official images. They have the least set of vulnerabilities. And there are scenarios where, you know, there are vulnerabilities which has not been fixed yet. So in my company, what we do is in such cases, we create a gate tissue and then we ignore it and we move forward and then we fix it later if it's not critical. But it depends on the processes that you have and you know the risk assessment and all of it that your particular company has. Right, the next one is code vulnerability scanning. Basically scan your entire code, like there are tools like Sonacube, et cetera, which does quality gates and scans whether you have hard coded passwords and stuff like that in it, which would love you to figure out whether there are any vulnerabilities in the packages that you have imported to your code and all of that and fix how to fix those. Again, this goes with the security practices that your company has and the policies. So there are certain things you will obviously have to ignore and go forward. There are scenarios where you can't do it and you somehow have to fix it and then push it. Then finally, configuration scanning. So there are tools like Chekov, CubeSec, Conf Test, et cetera, which basically scans your configuration files, like the YAMLs that you import on Argo CD. It would test those and against a bunch of policies or best practices, it would tell you, for example, let's say that you haven't set security context on your deployment. So it would tell you that you have to do this as a best practice, et cetera. So they allow you to enforce security standards and in your enterprise, you can have your own bunch of rules and apply them so that there are no surprises when you deploy YAMLs to the cluster. So I thought I'd talk about Kubernetes Admission Controller here. So this is how it works. If you are using something like Open Policy Agent, which you can say, you can only run pods, run images of a certain registry and nothing else. What you can do is Admission Controller has two webhooks, one called Validating Webhook and another one called Mutating Webhook. What the mutating webhook do is, according to a given logic and a bunch of annotations, you can actually change the content of the YAML before it gets applied and executed. Whereas the Validating Webhook, all it does is it scans your YAML and then makes a decision where they should go forward and get applied or get denied. And then these plug into third party policy controllers like the Open Policy Agent. Yeah, so that's how Kubernetes Admission Controller works. And if you do employ something like Open Policy Agent, this is how it would look like once you enforce certain rules on Kubernetes configurations. The next one is Container Hardening. How do you harden your containers? Pretty basic stuff, removing Bash and Shell is one of the best ways you can stop people from executing into your container and doing stuff. If you look at Kubernetes API server and all of it, you can't really SSH into them because they have removed Bash and Shell. Then the next one is making your root file system read-only. We make sure that your application is your container and your container is stateless and also no intruder can come and write things into your file system. And let's say that you're running something like EngineX or something that actually do write certain parts into your application. Then you use MTDIRs. Then you mount it into your certain parts and then it allows you to do writes but the base-contain image at the file system you cannot. Then finally, run as a non-root user. So certain images, they do need privileged users, like for example, if you're running Falco, you need a root user. But other than that, normal applications like your Java app, your Python app, et cetera, which are non-system apps, make sure to run them as non-root so that even if somebody SSH is in and attacks your container, they will be restricted within that, they will not go into the VM level. Then let's say that the image is already built, you don't have control over the image, all you can have is the control of running it. Then there are a few things you can do as well. You can use a startup probe and execute a shell command to remove bash of your container. That's quite easy to do. Then there are in-security context policies. You can set as run as group, run as user, and run as non-root to set things up. And also, as I mentioned earlier, enforce these roles using Open Policy Agent. So that's how those are the strategies you have for container hardening. Container runtime security. This is where when we do all our scans and everything is fine, but at runtime, there are some malicious code that is running, and we don't know, like maybe a Bitcoin mining or something like that, we haven't captured at runtime level, the static scanning level. So little intro on how our containers work. As you can see, physical hardware runs below, and then we have the Linux kernel, and then we have the Linux container LXC running. It is like the virtualization. And then you have containers running. So basically, if all of these containers are now talking through syscalls to your kernel of the VM to execute processors and commands. So if there's a vulnerability in your kernel, the containers actually can exploit it and then attack other containers in your VM. So in a single, tenanted scenario, this is not a big case. But if your cluster is, let's say, multi-tenanted, like you have outsourced certain namespaces to third-party software vendors or whoever that's building in your infrastructure, this could be a problem. That's where container sandboxing comes into play. By using container sandboxing like GVisor, Kata containers, et cetera, they would isolate your container processors in such a way that they cannot be exploited and attack other. Your kernels cannot be exploited using those. If container sandboxing is not an option, the other option you have is using other app armor or seccomp profiles. Using either app armor or seccomp profiles, what you can do is you can actually restrict processors or syscalls happening in your kernel. And then you can actually say, these are the syscalls that are allowed and anything other than that is not allowed. So there are standard default seccomp and app armor profiles available. So you can use those to run your normal applications. And then there are other tools that continuously monitor what happens in your runtime processors. One tool is called sysdig falco, which is open source. By utilizing that, you can quite easily it has a default rule set as well. So if somebody has executed a remote shell or if someone is writing into the file system, it automatically gives an alert and you can act accordingly. So these are the strategies for you to manage container runtime security. This is a general rule of thumb, basically. Don't use environment variables. Always mount them as files. And then if you are managing secrets, it's recommended to use an external vault like HashiCorp vault or in the clouds, they already have secret managers available. And as a developer discipline, do not log sensitive information. I've been with certain organizations that there have been PII issues and other issues where they've logged tokens and et cetera when an error has been thrown. So yeah, that's a developer discipline that you have to do. So that's it. And then this is for if you're running multiple tenants on your Kubernetes cluster, isolating your network. One is namespaces. But I think a previous peak already spoke that namespaces is a very weak isolation. So you have to harden it. So one way of hardening is using network policies. By using network policies, you can quite easily say the containers in this namespace cannot talk to the containers in the other namespace. It's sort of like a firewall. And then even if you have API gateways, you can actually make sure even the internal services can go through the API gateway with authentication. But if not, then there's nothing much you can do. Or use MTLS with a service measures, I think Syllium now has MTLS with EBBF. But here's the thing. Just because you install LinkerD or Istio and enabled automatic MTLS doesn't mean you are secure. It basically identifies who's talking to who. That's it. You have to attach network policies to so that you can govern and say, only this can talk to this one. MTLS would just encrypt your network traffic, which is good. But you need the whole solution. That is it, or isolating your tenants in the network. So here's a small example of how network policies work. As you can see, there's an ingress all allowed policy for the simple app namespace. And the request do come to web server. And within it, we have added network policies so that web server cannot talk to database. And web server can only talk to the Python backend. And Python backend can talk to the database. So because web server is the public exposed one, if somebody hacks into web server and gains shell access to the web server, they can only attack Python backend. They cannot directly attack the database. So that is how you do security. Right. So the next one is protecting your control plane. I think I'm about time also. Right. So hardening your Kubernetes cluster. So I heard from some people that they have created clusters using Rancher. I think by default, Rancher do enable audit logging. But just to make sure, I think the default way they do it, they don't mount the audit policies and then mount the logs back into your VM so that you can take a look later. So that's very important. So when you enable audit logging via RKE or something else, you have to make sure you give a path to log the audit. And also, not just the path is enough, you have to mount a host path. You have to basically provide a path on the host and mount it to the API server so that it's actually retained. Even if the API server dies and restarts, the logs are retained on your master node. So that's very important because I've seen a lot of enterprises when I asked them, oh, I have audit logs enabled. So can I see that? And then when you access into the VM, there's no audit logs that you can see. So they have actually enabled the audit logs. But they're actually writing within the API server container, not in the VM. So you have to mount the host path volume so that you can retain it. The other one is use CIS-hardened node images. I think if you're running even Azure or Google and if you're running your own clusters rather than managed services, they have in their marketplace CIS-hardened images. Use those to spin up your nodes. And even on-premises, I think there are CIS-hardened Ubuntu and whatever available. So you can use those. And then another one is encrypting at CD. I don't think Rancher, by default, encrypts at CD. I'm not sure. But it's a good practice for you to encrypt at CD as well. Like if somebody hacks into your VMs, like let's say, an example, or somebody from Mission Impossible comes into your enterprise, right? You hack into your VM, but you look at the at CD. It's encrypted. But yeah, of course, if it's Mission Impossible, they decrypt it. But in the real world, I think they'll find it hard. So yeah. And I already mentioned this. Mount your secrets as files and not as environment variables. Because if it is environment variables, what happens is, if somebody hacks into your VM, they are able to actually list the environment variables via the process. There's a way you can do it. But if it is files, then actually you can't do that. So that's why it's recommended for you to mount your secret as files. And also, don't keep the Kubernetes secrets. Just use something else, like Hasagop Key Vault or something else, or even Bitnami Secrets is fine. So yeah. I'll just do a quick round off on Kubernetes RBAC. I think there's another talk coming late on OIDC and how to connect with Kubernetes. So just the basics. In Kubernetes, there's nobody, like, doesn't have a concept called users. So users are managed externally. So Kubernetes don't have a resource called users. It's a matter of where we create a key pair, and then we authorize it with the CA of the cluster, sign it off, and then that person who has the key pair can actually access the cluster. Kubernetes has a concept of service accounts in which, using the service accounts, you can provide some roles, and et cetera, and which can be mounted into your containers. And then you can use Kubernetes API. So when it comes to permissions, we have roles and role binding. And then we have cluster roles and cluster role bindings. As the name suggests, role is binded with a namespace. So if you have a certain role, like read secrets, it's only available in that namespace. And when you do a role binding for a user or a service account, they can only execute that only in that namespace. Whereas cluster role is cluster level. And if, let's say, me, Nilesh, I have the cluster role of reading secrets, if I do a cluster role binding to me, then I can actually read secrets of all namespaces. But if I do a role binding with a namespace, attached, I can only read from that namespace. So there's no way in Kubernetes you can do, like Nilesh can only read these namespaces. There's no way. If you want to do something like that, you have to do role bindings per namespace. That's how you do it. And this is an example. I just gave an example too, so I'll just quickly go through. John has a role binding. So there are two secret roles, read secret role, and read write secret cluster role. And then on full namespace, John has the read secret role bind into John, so he can actually read secrets in the full namespace. But if you look at Jane and Admin, Jane is connected to the cluster role with a namespace bound role binding. And Admin has a cluster role binding to the cluster role. So as you can see with the red markers, that Admin can actually read all the namespaces, whereas Jane can only read and write into the full namespace. So that's it. All right, so we are at the end of my presentation. So final thought is, in Kubernetes, it's not new now, but for a lot of security engineers and et cetera, it's still new. They are figuring it out. And there are vulnerabilities exploited every day. And if you are running cloud-managed Kubernetes, you only have to worry about your application security. Everything else is managed by the cloud control plane. And if you do run with proper best practices and proper developer self-discipline, you should be all right. So that's the end of my presentation. Thank you so much. Any questions? Right. What level should we use, PSP, or what are your thoughts? So her question was, I think, pod security policies versus open policy agent. As far as I know, with Kubernetes 1.23 or 2.2 onwards, pod security policies were deprecated. And they are also enabling you to work dynamically with something like open policy agent to have some set of rules and then go through that. Because I believe pod security policies are a bit limited on scope. That's why they deprecated it. So now the Kubernetes community, Kubernetes is also recommending you to use something like OPA to have set of rules and then use their gatekeeper to enforce those with the webhooks that I explained in the slide. All right. Any other questions? Yeah. Day nodes, master nodes, and the EDSD will be there. The synchronization between the applications and the ramp with request comes, load will be very high. So in that case, when we do for each request is there an encryption mode. So it will increase the load of performance in the grid. So how we can increase the input mode if the performance is very important for the EDSD type? Right. So interesting question. So his question was, when you have multiple EDSD nodes and when you enable encryption, and then if you have a lot of load coming in and if performance is your core requirement, how do you do it, whether there's a problem? So I personally, there are limits in EDSD as far as I know as well. And of course, encryption and decryption would give you a performance, a slight performance one as well. So I actually managed the Sri Lankan government election Kubernetes clusters. So we did encounter that as well, because on election nights there are a lot of requests coming in and polling and all of it. But here's the thing. It's a trade-off, either it's performance or security. In the government scenario, it was security for us. So the performance issue was a trade-off that we had to endure. So sometimes those are trade-offs. There's no magic solution for stuff like that. One more question. So these security things, right? So let's say the read-only and all these things, even if I said, how can I monitor that all my nodes? Let's say if I'm running a big cluster or multiple clusters, how can I monitor? Are there any tools from which we can see what are the open security issues that are right? Because, let's say, part of the teams are implemented and some don't, right? How can I manage that? Yeah. So this is where the build versus buy thing comes into play. So there are tools that actually love you to do that, but you've got to pay for them. One thing is, I don't know whether it is. I think I can talk about Sysdig, right, and CNCF1. So Sysdig is one of such tools. Then there are other new act and few others as well that does it for you. But if you were to do it yourselves using only open source, then you probably need to have a good platform in your team with CubeSec, sorry, Falco setup, Open Policy Agent setup, and do continuous scanning. Because let's say I scanned my image today and it passed and it's now running on the cluster. And then you haven't done a deployment for about a week or a month. They have a lot of, remember the log4j incident? Log4j was an exploitation that has been existed for a while in Java. So you have to continuously run static scans as well against your stuff, yeah. But Sysdig, it will automatically do the live monitoring. So in Sysdig, can we see a live monitoring? Will it automatically keep looking for that continuous? It does. As far as I know, it does. So don't take my word for it, check it out. But as far as I know, they do. Got it. OK, thank you. OK. Thank you. All right. Thanks, Loh. Thank you. Thank you so much.