 I'm so happy to be here at Kubekon and it's a talk after almost three years of not giving any talk or any participating in any conference, so it's like a new start for me in a way. So I'm so happy, still nervous. Today I'm going to present the top ten is the security risks and mitigation strategies. You might have heard of the OWASP button for webs and this is an idea that came up because I found this need in the cloud native space to have something specific to cloud native because cloud native architectures has like specific problems with specific environments. So yeah, before I start let me introduce myself. I'm Jose Carlos Chavez. I am originally from Peru but I am based in Barcelona. I am an open source enthusiast. I participate in some open source projects. I like that. I'm also an OWASP co-leader for the project called Coraza which is a web application firewall and for the zero trust times. I'm also a Zipkin core maintainer. If you hear of distributed tracing, I'm there as well and I'm a loving father as you might think. Before I continue, I want to spend a couple of minutes talking about how all this started because originally my idea was to build a list of awareness for the cloud native or for Istio landscape in terms of security risks but then in the process, the security tag for CNCF got some attention and there is an interest on this. So it started as a list for Istio but in the end we are going to make a top 10 list for the cloud native ecosystem like including Kubernetes, Istio, service mesh in general. So yeah, it's growing. So without further words, let me start. We will first talk about what are security risks, of course, to understand what we are talking about. Risk are something that is hard to define or at least hard to make concrete, right? You want to evaluate likelihood versus impact, right? Something that is unlikely to happen but has a huge impact then is a billion risk. Something that is likely to happen but the impact is low then okay, it's like not two important risks. So what we measure usually is how easy is for an attacker to carry out an attack in our system, right? Is it easy? Is it just triggering a call? Is it doing a DDOS attack? What are the skills needed, right? It's not the same like I write a script and then start launching and go routines that will send HTTP requests. It's completely different as if I have to get into a server and then from that server open a tunnel and then from that tunnel SSH tunnel to open a connection to another server and then start attack from that is different. And then how cheap it is, right? Because it's not the same like doing or triggering an attack from my laptop versus triggering an attack from a cluster of the multiple computers working together. So yeah, that's something we should take in consideration when we evaluate the risks and then it's also important how sensitive the data or how sensitive the systems are like being affected for this attack, right? Because if I have the resiliency patterns, let's say I have a CDN in front, I don't care about whether they are attacking my web or not, right? But if I am resolving, I'm evaluating, I'm doing a database call on every request then I'm going to pay more attention to this. So whole valuable and sensitive the target data is, right? If someone is trying to attack my database or still data, then I have to evaluate, okay, but is this data worth it? Like imagine people, imagine an attacker is targeting to steal my logs, right? From the storage then do I have PII in the logs? If the answer is not and I have a backup, it's fine. Well, not fine, but it's not that important as if I had PII in my logs that I haven't redacted and then I will like, they are going to get really valuable information and then I have to disclose the breach, right? And then how hard is to recover from that attack, right? And then we have mitigation strategies, right? Which is basically how we deal with risks. And this is about being honest about what is going to happen, right? Because you can assume and accept the risk and say like, okay, this really exists and let me assume it in favor of something else, right? Doing a trade-off, avoiding the risk because it's not in my control to basically fix this. So I will just avoid it maybe in a silly way, like putting a firewall in front and then I don't have to care about what happens internally, theoretically, right? Contrary in the risk, like I cannot just get rid of the risk, but I can control it. I can control the effects of it. Transferring the risk, as I said, like for example, if you put a CDN in front and if you put something in front of my application, then I basically transfer the risk to an upper layer, right? And then watch and monitor the risk, which is basically, I accept that this happens. I will just monitor in case this get out of hands, right? So now, but the question now is, is an Easter secure by default? Like, why do I care about the security risk in Easter, right? Because Istio is a service mesh, but it's also a software where it includes a lot of security features. So do I really, should I really worry about it or Istio is secure by default because it's in the cloud and baked by Google and major vendor, so it's safe to use it, right? Well, security is definitely something that service mesh adopters look for, right? 79% of the service mesh adopters, according to a CNCF survey, were looking for security. And then 22 believed that the most important feature that Istio was bringing was security into their applications, according to a solo report. The problem is that whenever you are adopting a service mesh, there are a lot of complexities, right? That you have to deal with, because now you have a lot of authorization policies, you have to, you have pods, you have to name spaces like, your architecture is completely different now. And then dealing with that complexity usually make you push security as a second class concern. And then you just react when something happens, right? Which could be too late at some point. This report from Red Hat was saying that 31% of the respondents were accepted that they had a customer loss or a revenue loss due to a security incident in the past year. So security is the attacks specifically targeting cloud native are in race and really affect your revenue. So before we understand whether the risks in Istio specifically, let's see what exactly means security in a service mesh, right? Because we have security is like a combination of multiple protection mechanism across multiple layers, right? Security is the opposite as life, right? In life less is more, in security more is more, right? The more you protect, the more safe you are, it might not be enough, but still you do your best effort. And then you have a lot of redundancy, right? Because you have to have different say fronts where if one falls down you have another one to keep, try to protect your system, right? Specifically in Istio, in an Istio deployment, we have the underlying infrastructure, which could be the cloud or it could be a non-premise server or a bar metal server. Then on top of that, you have the Kubernetes platform, which is the orchestrator of the pods, right? Where you define what are the pods, what are the services, the applications. And on top of that, you have Istio service mesh, which is basically coordinating the networks, is the orchestrator of the network, right? How they communicate with each other, which is supposed to communicate with which service and all that. And then you have the application. So you have four layers where you have to protect. And that's why it's so complex to protect your deployments, whereas an attacker just need to find one single vulnerability, one single entry point to get into your system and then start attacking you from that, right? So we should also look at what are the threat actors for these attacks, right? Because as soon as you identify what are the actors, you're going to start taking actions, right? You have first the internal attacker, which is someone with probably not the highest privilege trying to perform actions that are not according to their level of permission or their authorization and push the boundaries in terms of what actions they can perform, right? Then you have the contributors to Istio, which is people that is supposed to be nice and try to build the best protocol possible. But sometimes they could try to attempt to include malicious software that can be deployed in your system and then take advantage of it, right? Probably they are the less, but it's a threat actor, right? Then you have the contributors to the third party dependencies, which is more fun because it's people that individually build libraries with not necessarily a purpose of being used in Istio in the end. And then I was reading about this attack where people was creating users in GitHub with one typo different from a well-known, well-respected library. And then the attack was consistent on opening pull requests with that library maliciously, that there is only one typo so no one can detect it. And then it get merged into Istio, it get deployed, or it get not even into Istio, it get merged into a dependency that is a dependency for Istio. And then it gets deployed and then the attacker can take advantage of that, right, just because of a typo that nobody will suspect. So yeah, that's also interesting. And then you have the untrusted users, right? People which has the lowest level of privilege probably outside users that try to run attacks and perform actions in your system, right? Try to find this vulnerability to get into perform actions or to get to the host. So all these people or all these actors are supposed to attack you and try to get into your system and try to perform actions. But if you think of, okay, what is truly the main threat actor when it comes to deploying such a complex system like Istio? Because it's, of course, designed to achieve complex tasks, right? And you end up in the user. Because misconfiguration is one of the biggest problems in security. This news is from yesterday. And I was not surprised because I was preparing this, but it's interesting how the human factor is still a big player in this game, right? And so without further introduction, let's go into the list. By the way, this list hasn't been prepared in an order of occurrence or frequency. It's been prepared in an order where I was balancing the impact and the feasibility of the attack based on talking to professionals, my experience at the trade and all that. And later I will mention the survey we are conducting about this to get more data. So first things first, let's talk about insecure communication. Well, it poses a significant security threat because you can have different kind of attacks as your communication is not encrypted or not secured, right? You can have on path attacks, which is money in the middle, basically. Someone is in the middle of a communication listening for information. If it's not encrypted, then they just get everything they want. You can have spoofing when a server pretends to be something else. And because it's plain server or plain traffic, then basically they can listen for previous responses and the response accordingly. Credential is spoofing, brute force attack because nothing is encrypted. So, I mean, you can try as much as you want and nothing is stopping you. Fishing, malicious API requests, etc. And you might be wondering why insecure communication is a problem because Istio comes with MTLS by default and you have all these great features about securing communication. In security you have this dilemma whether something is usable and something is safe to use, right? And you have to do the balance because something could be really usable, but then really insecure because that's why it makes it usable, right? There is no barrier in usage or something is very secure, but then unusable because basically you need to deal with policies that you are not supposed to do when you are getting started, right? So, Istio permissive security setting is useful because you can onboard legacy servers. You can try the concept of communication among components, but then all data is either plain text or encrypted traffic. So, you don't enforce encrypted traffic and hence you don't enforce security. And a stricter security setting is going to cause that all the traffic is going to be enforced to be secure and then your legacy system won't be able to onboard into the mesh, right? Causing a barrier in usability. This is a very common thing because although it's really easy to enable MTLS for Istio, it's just an per authentication policy where you deploy either in a name space or in the system name space to enforce it wide mesh. You will have problems by onboarding all systems, right? If you... So, one mitigation strategy is enable MTLS for everything, which is possible. If not, because you have legacy systems, you still need to onboard in the mesh. You can enable permissive, but then use authorization policy to restrict traffic in plain text, but you can restrict who am I accepting traffic from, right? Moving to the second one, we have unsafe authorization patterns. And this comes from the old days when people was writing firewalls, right? And how this concept of allow list and deny list appeared. Because when you are writing policies and when you're writing rules, what you value the most is to be deterministic, right? You want to know exactly what am I going to accept rather than what I'm going to not accept or deny, and then you don't know exactly what you are going to accept, right? There is no deterministic answer on what is possible and what is not. So, this is a really common pattern. There are a lot of... Even in GitHub, you can find a lot of policies that are more like deny list, more than an allow list, right? So, it's not explicit what you are accepting traffic from. One of mitigation recommended by Istio is use default deny for everything, because default deny means that whatever is not declared in... Or whatever I'm not supposed to accept traffic from, I'm not accepting traffic from, so I'm safe in that sense. Allow with positive matching, meaning that I am declaring exactly what I am receiving traffic from or what I am accepting traffic from. And then deny with negative matching, which is basically the same thing, but the declaration is the opposite, right? But logically, they are the same kind of condition. I am denying, but with a negative condition, right? Those can make you have deterministic and safe policies, and then debugging security incident is much more easy. Moving forward into the third, we have weak service account authorization, right? And this is one of the most important principles in security is that you have the least privileged principle, right? A user should only be entitled to do the minimum amount of things that led them to do their task, right? They don't need, like, more permissions than what they are supposed to do. We have many examples in Istio. The first one or the typical one is a need container, right? Whenever you have a new pod, that pod has a need container that will allow them to create their network policies, right? That means that they have to have permissions of net admin or the capability of net admin to do so. And that poses a security risk, because then whoever gets inside the pod will have that capabilities. Then you can bypass or outbound traffic policy by impersonating the Istio proxy user, right? Some containers have this, let's say, convention or feature where you declare the user that is running the container with a user ID, right? And then that will match with the host. That's why one recommended mitigation is that you don't run containers as root, because then someone, an attacker, can mount a folder that belongs to the root in the host, which doesn't have permission for. And then from the container, we'll be able to access those files just because the user ID matches. The same happens with the Istio proxy. People can impersonate the Istio proxy by using the user ID and get permissions for that user ID, right? Then the third is the usage of first-party Jots, right? Which is something that was very popular in the past now, not anymore, but still deployments that have this, which is basically whenever the pod is supposed to contact the control plane, they are using a Jot. The Jot is mounted into the pod, and then the sidecar is going to use it, but any other container into the pod could eventually use it, because the first-party Jot doesn't have an audience. One mitigation strategy is, of course, to use the new third-party Jots, which restricts the usage of the Jots for a specific audience, so only the sidecar can use it, right? A mitigation for the init container would be to use the Istio CNI plugin, which will avoid the requirements of the privileged net admin for the pod, right? It just happened at Kubernetes level. You don't have to worry about that, and you don't have to give permissions about that. This also exposed the fact that, although Istio can't attempt to be as much as possible secure, it depends on the underlying platform, which is Kubernetes, right? So you cannot make Istio secure if you are not looking also at Kubernetes and how the policies are declared there. Later, we will see how also there are things you cannot achieve with Istio, but you need to achieve with Kubernetes. So going to the next one, we have the well-known broken object level authorization, right? The Bola one. So Istio provides authorization policies, right, to perform checks on HTTP headers, on the path. Some Kubernetes metadata, okay, where is this originated from and where it goes to in the services, as well as validating Jots. One of the bigger problems of this kind of authorization policies is that they cannot access to the Jot fields, right, where usually, because, you know, putting it in context, Istio is not emitting the Jots, right? You have to have a third-party service that is creating the Jots for you and then you use it in Istio. But then, the problem is that Istio cannot know every single possible field that is out there, depending on the provider. So Istio only access to the issuer and usually the providers put more metadata about the user into the other fields of the Jot, which Istio doesn't have access to or doesn't understand, right? So these granular policies that you might think of, for example, I only want a team manager, or imagine you are in an HR system and you want to restrict that only the team manager can create users in or employees in their team. That kind of granularity you cannot get with authorization policies because you don't have all the concepts that you might need, right? And then another problem is that policies get out of sync with the architecture, right? There is this interesting concept called Conway's Law, right? Whether your architecture is a reflection of your organization. This is something similar, right? Because you are in a microservices world, or in a microservices architecture, everything changes so often that you cannot keep track of every change unless they are based on the same source of truth, right? You write a policy, tomorrow the service is changed, the permissions change, the path changes, the API changes, and you basically don't have a way to keep track of that, right? Permissions, group, users, privilege. So mitigations, well, first of all, all access decisions has to be based on the least-privileged principle, right? Everything should be decided per request, meaning that these static authorization policies cannot resolve per request, right? Because they are more static at generic level, context-based and based on identities, right? And then one recommendation from the NIST is to use rich-model policies like NGAC, which is the NIST standard for permissions, or OPA, for example, which is a well-known software, right? But you can express policies in a more granular way with more information from the jobs, and then you can do these more complex assertions about, okay, I belong to this group, then I am entitled to do certain things, right? And one interesting thing about NGAC is that basically it's a graph, and then you can model the permissions and the users and the groups as a graph, and then it's really easy to understand what happened, right? For example, when you are debugging an OPA policy, you fail to understand, okay, I know this was rejected, but why exactly it was rejected, right? With NGAC, you can basically trace the graph and say, okay, this was rejected at this point. Okay, supply chain vulnerabilities. This is probably the only one risk that we are mentioning that is not related to misconfiguration. Istio is an open source project which is based on many open source components, and third-party code. On top of that, like M-boy and Prometheus, on top of that Istio runs on Kubernetes, so there are a lot of bolts jiggling, right? So some of the risks in a typical Istio deployment are not only the Istio components, but the images that you are deploying, right? And then some of the risks are image integrity, like how do I know that the image I am using is exactly what it was produced by the user or by the author? Image composition, like every lawyer in the image has its own security risk, right? Because you are downloading software, you are granting permissions, you are creating users. Known software vulnerabilities, of course, that you are not necessarily able to patch easily. Some mitigations are image scanning, like you can scan the images when you are building your application on CI. There are a lot of tools like Sneak where you can scan your artifact looking for vulnerabilities. Image composition on software builds of material, right? The software build of material is like a receipt of things that are in your image, so you can also analyze them and then assert whether it is secure or not. Image signing, where you can check that the image was exactly what I was expected to be. You can have a curated registry, which is probably a composition of all these mitigation strategies where you have only something like Artifactory where you can regularly run checks because one of the interesting things about these mitigation strategies and why they are complementary and they overlap is that you can have your image registry or your artifact registry, and then although you are safe today, tomorrow you might not be. So you have to keep running the analysis. And then when application firewall, right? Which is a way to protect things that are broken and you cannot fix it easily, right? If you think of, for example, in the log4j vulnerability, when it, or log4j, exactly, when it happens, it will happen sometime until people could actually fix it in the library and then sometime until you can include in your application and then sometime until you can deploy that into your mesh. So the way you can easily avoid that risk is, okay, patch the network with application firewall, all the query parameters that look like this in this regex, I'm gonna block. So then I avoid the risk. Until I can properly fix it, right? Ingress traffic capture limitation. This is interesting because Istio, although Envoy supports some sort of UDP traffic, Istio proxy doesn't support UDP traffic. So all the traffic is gonna bypass the proxy and go directly to upstream. Some of the inbound capture is disabled on ports that are used by the sidecar, that by default. So some mitigation for this, for UDP, exactly, to control the UDP traffic, you need to use Kubernetes network policies at Ingress. So you can restrict what, where is traffic I am accepting, right? Where does it come? And network policies are like firewall rules, right? You can implement that name and space level. It can be both pod-baset, name and space-baset, IP-baset and then you can have several policies and they will be all or like you can apply them together. Ingress traffic capture limitation. This is similar. Istio cannot securely enforce all the traffic going through the sidecar or the Ingress gateway. So technically you could get into a pod and Google or whatever. So one mitigation strategy is of course to use the network policy for Ingress. That's one. Ingress restrictions where you can use the outbound traffic policy where you can say, okay, only those services that I know that are in my registry. Or another more interesting one is that you can set up alerts on a Linux system called events. Like you can use tools like Falco, which are constantly monitoring Linux system calls and then okay, the common attack is that I get into a pod, I install curl and then I curl a URL to get in something. Falco is monitoring the system calls and then it will alert you, okay, someone installed curl in a pod. Someone did a curl to a URL in a pod. Then you can monitor this risk, right? Because there's nothing you can do as besides monitoring. Okay, security observability and monitoring failures, right? Security observability and monitoring are critical components of any system actually, not only on Istio. Not having security is not only a problem in terms of how you react to an attack, but also how do you understand the attacks, right? Problems you have is the log level paradox. Like when you don't need the log, the log is two verbose. When you need the log, the log is too quiet. That's usually what happens. Is sufficient only adequate audit logs, right? Because you don't have enough information to understand what happened. Also, I will explain quickly this concept where you keep seeing, for example, failure attacks in your audit log and you say like, oh, my firewall is working, right? But then at some point you just don't see them and you say like, oh, maybe the attacker just surrendered, right? But what happened is that the attacker bypassed your firewall now. So you need to understand what happened before and then that information will let you try to understand what is happening now that the attacker is now causing more errors, right? And then lack of context, right? Because sometimes we don't have enough context of what was happening, when was happening for an audit event. Mitigations are ensure access log and error logs are emitted. You can enforce that on CI. You can even impose a JSON schema on your logs output. So you can verify that it has the fields that you expect to have. So you can do search and aggregation and understand everything. Log data should be encoded and redacted correctly, right? Because it could contain PII. There are plenty of tools that record the request payload and the response payload into the logs, into the access logs and to the out logs. And then it can contain PII. And then if there is a breach, that information is exposed, right? You should ensure that high value transactions have an audit trail that you can follow what happened. That's very important, correlation. And establish an incident and response recovery plan, right? That's also important. You really need to know how you're gonna react and have a plan in place, right? And number nine, vulnerable Istio versions. Well, as any software Istio being used in an older version is a security risk because of non-vulnerabilities because it can be susceptible to DDOS attacks, to CBEs, to programming bugs where you can bypass Istio policies, cryptographic problems, failures. The mitigation, one of the mitigation is, of course, to use the compliant Istio distribution. For example, the trade Istio distribution, which is FIPS compliant. Track the CBE databases and the Istio security bulletin so you're aware of the latest threats, the latest security problems in Istio. And then of course, use a web application firewall, right? Which is merely avoiding the risk. The problem with the firewall is that it happens inside Istio, so it could happen before authorization, before authentication, but it still is part of the history. So if the attack happens before, the firewall cannot help you, right? And then finally, number 10, what is your security risk, right? As we said, or as I said, we are gonna conduct a security report and a survey about what are the most common security incidents that people have and adopters of Istio and service machine. In general, and with that, we can curate the list in a way that we serve for everyone. We are also looking for people willing to participate in the elaboration of this list. So come by the security tag. We're gonna be happy to have you there. And finally, conclusions. Well, most of the security risks are related to configuration mistakes, right? Humans, you always should prefer like being explicit about what are your rules over automatically capability showing up. No single component can or function will be sufficient to secure your system, right? By itself, you need a conjunction of many things to defend in the system across all lawyers in your infrastructure. And policies have to be defined based on the assumption that attacker is already inside the network, right? You can always keep defending the castle from this side of the river. You also have to have means to defend when they are inside the castle, right? Thank you so much. Thanks for coming.