 Fy hoi, ac fyddwch i gwybod ar gyfer am ymweld y cyfraddwyddau, ychydigodd ar gyfer amddangos, ychydigoddau am y cyfraddwyddau, a oedd yn cyfraddwyddau rych fy modell ar y cyfraddwyddau. Felly, ymwyng James Callahan. Fy hoeddwch ar y cyfraddwyddau ar gyfer cyfraddwyddadau. A fyddwch eich ymddangos yn ei cyfraddwyddau yn ymddangos. Yr hyn yn ei ddweud. Fy hoeddwch Fyhaith Feidderstyn. Fyddwch i'n gallu meddwl am y twind. I'm the head of engineering at control playing, we're a cloud native security consultancy, we're hiring at the moment so if you want to work on cool problems with cool people, come and have a chat with us, if you've got cool problems that you need solving, come and have a chat with us, and I'll head back to James, but thank you for joining us this morning. Cheers, Rick, and like I said, we are a booth SU 57 and we'll be here around for the rest of the day, so yes please do come talk to us, we're all friendly and would love to chat. So what are we going to talk about today? This is a kind of a narrative talk, we're going to take you on a journey, first of all we're going to start off by trying to understand together what zero trust means, it can be a bit of a buzzword, so we're going to understand starting from a threat modelling perspective. Once we have developed our zero trust philosophy, we're going to use this to build a high level architecture. Once we have a high level architecture, we're going to create a detailed threat model, dive into the details and iterate on our controls. We are going to prototype this, so there is a public repo which you can clone and spin up yourselves. The message is, we'll give you the link at the end, but the message is really prototype early to understand how the components of your system fit together so you can think like an attacker. This talk will focus on zero trust for workload-workload communications, worth mentioning that we're not going to touch on zero trust in the supply chain because that's probably a whole suite of talks just on its own, so yes. What is threat modelling then? We all threat model all the time intuitively, so when I went to a piano bar last night and the pianist started playing Rick Astley, I had to make the decision do I join in or not. I thought the risk of me not having a voice today was too great, so I did a little threat model. I derived a control which was don't sing, you idiot, and everything was good. But if we do this in our personal lives, why would we not do this with our IT systems? Threat modelling systems is all about identifying and enumerating threats and vulnerabilities, formalising this in terms of a risk management framework and escalating risks as part of that framework once they've been quantified. Threat modelling gives us loads of benefits, so we identify security flaws early. We save time and money by doing this. We understand complex risks which we couldn't understand otherwise, and the key message that we want to get across is that everyone should threat model is not just something for security teams. Really the people who should be threat modelling are the people who have developed features, engineers, developers who understand code the best because they can put themselves in the position of an attacker the easiest. So threat modelling is an iterative process following this four question timeline. First of all we ask what are we building. We draw architecture diagrams, data flow diagrams. We understand at a high level and at a detailed level what our system looks like. Once we have this we ask what can go wrong. This is where we put ourselves in the position of an attacker. Think up nefarious scenarios and try and brainstorm these using techniques such as stride and attack trees which we'll do for our example system in a few slides time. Once we've done this we need to devise mitigating controls. That's the next step. We need to minimise our residual risks. Finally threat modelling is iterative. We need to be constantly asking are we doing a good job? Are our controls effective? Do we have effective automated tests which tell us whether our controls are working? Let's start to make sense of Zero Trust via a very, very, very high threat model. This is a very simple diagram. We've got a user interacting with a workload, two workloads communicating, one workload persisting data storage and interacting with persisted data and another workload calling cloud provider APIs. There's two keys to deriving Zero Trust principles from this diagram. The first is that we're not specifying anything about these workloads. They could be running anywhere. Workload one could be running on a Kubernetes EKS cluster for example. Workload two could be running on a VM on-prem. So we're not specifying anything. The next key is to define our trust boundaries. I'm sure lots of you have heard people say things like workload one and workload two are within our protected network. Therefore of course they can communicate by default. We say no. That is not the way to think about things. Shrink your trust boundaries down as small as possible and never trust, always verify. We're going to threat model using stride. Can I spoof any communications? Can I pretend to be someone or something which I am not? Can I tamper with information flows and compromise the system that way? Repudiation. Can I do something and then say I didn't do it about Simpson style? Information disclosure. Can I exfiltrate data to parties who should not be privy to that data? Denilo service. Can I take the system down? It's going to be very important when we're talking about workload identity that we have a highly available mechanism for distributing those identities. Finally, can I escalate my privilege and do something I should not be able to do? Let's derive our high level architectural principles from this very simple threat model. A really easy way to do this is draw a table. Threats down the left, high level architectural controls on the right. We'll go through this line by line. Spoofing, first of all, user impersonation. Like I said, we're focusing on workloads in this talk, so we won't go into too much of the cool stuff you can do in the zero trust space around users. We're focusing on workloads. However, we need to be cognizant of the controls there, so user authentication and authorisation best practices and establishing provenance of our users. Workload spoofing, however, is where we come to our first kind of topic for the talk. This means if we want to do this well, we need the concept of a cryptographically verifiable workload identity. We're going to use a technology called SPIFI, a framework called SPIFI, and the production ready implementation of the SPIFI workload identity framework called SPIRE. We'll show this in the prototype later on when Rick will give a demo. Workload spoofing, as well. We want strong integrity protection on the client side and the server side of any communication, so we're going to use MTLS. We're going to make things easy for ourselves by using Istio service mesh for our two workloads in our example prototype. MTLS obviously helps us with tampering risk as well, altering information in transit, maybe not so much the mutual part but the encryption part. I guess the integrity checks as well. We could tamper with stored data, so this is where we need strong authorisation policies everywhere, and you can see another high level control later on is policy as version code. Policies always link back to organisational business requirements, therefore we need to be able to keep track of who owns a policy, how do I make a change to a policy, and the way to do this well is policy as version code. You can hear some good talks by Chris Nesbitt-Smith on this exact topic. Repudiation risks, this is where we want to tie cryptographically strong identities back to things which workloads are doing, so we maintain audit logs. When it comes to policy, our demo is going to use a general purpose policy engine, an open policy agent, so this control will be all about maintaining decision logs and making sure that those decision logs can't be tampered with. Exfiltrating data would be an example of information disclosure, so again it comes back to policy, in this case egress control and network policies. Preventing workloads from communicating, this could be a denial of service risk. If we're using Spire and I as an attacker can take the Spire server down, how are workloads going to get their identities? We need to build something highly available. Our demo later will be not a production ready demo, so this would be further work that we would have to do, and we would get that work done by doing more detailed threat models of those sorts of denial of service attacks that could take place. Finally, a compromise workload could pivot, so workload 1 may not be able to hit a specific endpoint and lease privilege authorization policies, again are the way to enact this. Now we've got our high level principles, it's over to Rick to build an architecture. I'm going to try and satiate your desire to see diagrams and code and demos without drawing too far away from the fact that this is a talk primarily about threat modelling. In our fictitious example here, we've got workloads, what kind of requirements have we got? We've got an external facing service, we want to make sure that we're using TLS to expose that service. We've got services communicating, we want to use mutual TLS, so we need to be distributing keys, certificates and trust bundles so we can verify those. Both of the workloads on here are going to be accessing various AWS services and so they need to get temporary AWS credentials and they're going to be able to do that using Web Identity Federation. We're taking a JWT, a JOT and using SDS to exchange that for temporary AWS credentials mapped to IAM roles that grant them access to the things that they need to do. So on the left there, we've got a very simple service. It's going to be making direct API calls, so at the bottom there is SPIRE, I'm not going to keep it kind of high level, but if you think of SPIRE as your identity vending machine so you can get your X509 identity documents or your JOT identity documents from SPIRE. The good thing about it is that the workloads don't need to know anything about their identity, they can be a bit amnesiac. I'm going, hey, who am I on SPIRE? You are service aches. So on the left there, it's going to get an X509 certificate that it can use to expose the API over TLS and it's going to get a JOT and exchange that for AWS credentials and get access to the S3 bucket to download, to retrieve a file and then present that to the user. And in that example, because we're using the client APIs, we're writing all of that. So if we move over to the other side, we're moving to the service mesh type approach. So we can remove some of that coding complexity from the developers there. We've got Istio service mesh there. Istio can retrieve its X509 material from SPIRE using the secret discovery service. So we've got that plugged in to the right of Istio. We've just got a side car there. Now that's periodically getting the JOT from SPIRE, writing that to a shared volume so that the open side car can get temporary risk credentials to download the bundles from the S3 bucket and make the policy decision. So if you can think of OPA in this case as basically a yes no engine, can I do this? Yes, no. We could do some of the authorization decisions just with Istio itself. OPA gives us a bit more kind of flexibility. It allows us to make external calls to bring in additional data and things like that. Keep that in the back of your mind because we'll come back to that in a sec because that's a bit important. So we've got Istio, we've got OPA, we've got SPIRE, we've got Kubernetes, and just to make sure that we get our talks approved, accepted, we're going to get as many CNCF tools as we can. So we're going to use Coverno over there to inject the sidecars into our workloads. So that's our example architecture. And as James mentioned, what we want to do is we want to be prototyping early. Why do we want to do that? So it helps us understand how the technologies that we're using work, how they integrate with each other, think about what can go wrong. As James mentioned, we're going to open source the repo and make it available for you to kick the tires and play around with this. And so in order to do that, we want it to be simple and reasonably cheap and fast to spin up. So instead of using a managed cloud provider Kubernetes cluster, we're just going to use a local kind cluster and just some S3 buckets. The offside of that is that for the OIDC discovery stuff, in order for AWS to verify the Jots, it needs to be able to access the discovery document and the key set to verify the Jots. And normally, if you're running in a public Kubernetes cluster, you would expose an OIDC discovery service publicly like that. In order to make the demo work, we're just going to use an S3 bucket and we're going to ship the discovery document and the keys up there to get that working. So let's have a quick check just on the example on the left-hand side and make sure it works. That is not a given. It's been working fine all week, came in this morning, went to deploy stuff. And at some point yesterday, AWS made a change to the default policies on S3 buckets. You cannot make them public by default and they've removed the access control lists. So as I sat there, I'll just check this works before we go for the demo and error, error, error, all panic stations. So can you all cross your fingers for me? No, seriously. Okay, so let's deploy example one. So remember, this is just a simple web service exposing an API or ACPS. It's going to get a file from an S3 bucket and serve that back to the users. There's the warning that they sent out in December that's at some point in April, probably the day before you do your presentation at KubeCon at you. We will break things for you. So looking good so far. So now it's not just, why is this not working? No, there we go. Now that's a relief. Can I get a woo? Thank you very much. Okay, so that's working. I'll get a woo and a few. Let's quickly have a quick look at the, because I know you want to see some code. Let's shrink that up. So the spy client libraries have got some kind of useful stuff in there. We create an X509 source here and then use the TLS server configs. We just drop that into the server there and then we'll run up the server and it'll get its X509 certificates directly from spy there. Create a Jot source and then pass that into our Flare handler. And this will create an AWS configuration. And then we've got a custom credentials provider here. We're passing in the Jot source and exchanging that for temporary credentials to create an S3 client, which allows us to download the object from the bucket and serve it up to our calling. And we can see that it's been issued from our spy server. Resolution is awful. And there we go. In the URI SAN, we can see it's using Spiffy protocol. You can see it's using the control plane trust domain. And we've issued this work that identity to our S3 consumer. Quite normally. So now we have our detailed architecture and a prototype. We can draw data flow diagrams. So data flow diagrams are essential to help us threat model. What we will do is kind of like apply stride like we did to the high level one to each individual communication. But now we can look at these network communications in a great deal more detail. So much detail in fact that the diagram doesn't fit nicely on a slide. So don't pay much attention to the nonsense. The gif is just there to show kind of our generic workflow. So we've got a user at the top coming through Ingress. If we follow the purple line down, we'll see the user is... We've got the spire stuff in the middle. The user is hitting a workload via the workload's envoy proxy. A decision is being made by OPA and this also applies to the workload workload communications. We see the workloads are pulling open policy from the policy management plane. In our case it's just the policy is hosted in necessary bucket. And then the workload is calling out to a cloud provider API. Both ricks have just shown us. So now we have our data flow diagram. Let's start threat modelling the detail. A really good way to do this is by drawing attack trees. Kelly Shortridge has shown us how to do this. There's a really cool app called Deciduous where you can make these yourself. We are just going to use graph is kind of under the hood and draw some pretty basic trees. We're going to use this key where green nodes are ours. Blue is and. The gray ones are kind of just single nodes which kind of end. There's no logic underpinning it there. And reds are out of scope. So things like we said, zero trust supply chain things out of scope for us today. So let's build an example tree. We're going to walk through a path of the tree and just focus on one risk and show how we would build a control and iterate the control. So let's start with a bad outcome. An attack tree always starts with something bad that the attacker wants to do. This will always come down to an attacker wanting to compromise one of the three unholy trinity properties of cybersecurity, confidentiality, integrity or availability. So let's take a confidentiality example where what our attacker wants to do is leak sensitive data. We're not specifying what this data is. They just want to exfiltrate sensitive data. So you can leak data either by sniffing traffic in transit, exfiltrating data, green nodes. So it's an all. We can think of a number of ways to do this. We're just going to focus on exfiltrating data. So like Rick said, you might want to use OPA. You might want to use this sidecar model with OPA. If, why wouldn't we just use Istio standard authorization policies, for example? Well, maybe we have really complex decisions to make where we want to pull in external data and make authorization decisions based on that external data. So in that case, OPA will have access to a data bundle. So why don't we just try spitting out data from the open container itself? Why don't we try and exfiltrate data via HTTP calls from the open container to an attacker control service and exfiltrate data from our database? What would an attacker need to be able to carry this out? They would need an outbound path available to them. So this is a blue node, so it's an and. And they also would need to be able to deliberately misconfigure the policy to spit out data in this way. What would they need to be able to deliberately misconfigure the policy? They would either need to be able to change the code. So they would need to steal or get unauthorized access to the policy repo, somehow merge their code. We can imagine quite a few controls in that space. So let's think about another way they could do it. They might just have misconfigured write access to the bundle storage location. In this case, it's just an S3 bucket. Why would they have write access to this? They of course shouldn't. We have a control in place here, which is I am, and the principle of least privilege. Only the policy pipeline should be able to write to that storage location. However, we never want to be one misconfiguration away from catastrophe. Maybe something is wrong with the pipeline. Maybe something is going drastically wrong with our system. We need to make an emergency change and someone gets emergency access, write access to just write an updated policy. We never want to be one misconfiguration away from disaster. That's the message. What we do with our tree is we do this lots more times and draw a kind of a decision tree. And it would look something like this. So we've walked one path of that. The details here aren't important. You can again look at this in our repo. We're not going to go through other branches of this tree today just to show you that the real tree here will be way bigger than this. So what will we do about it? Now it's time to design our controls. We'll take a simple table-based approach like we did before. We're going to write our more detailed confidentiality threats on the left now and write our more detailed controls on the right. Again, let's not go through this line by line because you can see this in our repo and in our PDF. We're going to just focus on that one. Walk through the tech tree that we did before. So if we look at the bottom threat on this table, Overwrite Policy Bundle, like we said, we've got two key controls here. Cloud Provider RBAC, Lease Privilege, and you can audit as well. So automated audits and things. Maybe that's a separate control, but here we just have it as one. And then you see C14, Policy Bundle Signing and Verification. Instead of just relying on our IAM configurations being correct, let's add an additional control and say that our policy pipeline should have access to a signing key, and OPAS should only be able to load signed bundles which have been signed by this key. So here we have Defence in Depth. We're saying that when building custom controls, some controls will need further architectural work. So if we did look into the details on that bigger attack tree before, we would have seen a threat which is Compromise Spire Datastore. If I could do this, maybe I could add in fake registration entries and convince one workload that I am a legitimate workload that should be able to talk to it, but in reality I'm a malicious workload. However, to threat model this in detail, we need to make design decisions about how Spire is going to access this stored data. So what we would do at this point is draw a lower level attack tree and do a detailed threat model. So going back to our example attack path, malicious internal actor exploiting misconfigured IAM to overwrite the policy bundle. We have a control, but we have not yet talked about the implementation. How are we going to do this? Are we going to use default tools available to us? Or is our risk profile a little bit more cautious and we want to be additionally secure and design a custom control, and this is where I hand it back to Rick. Okay, so this is the final architecture, and as we kind of mentioned, we want to threat model prototype and threat model again and keep iterating around this. So as James mentioned, we don't want people to be messing with the bundles. So where we start is we go right, well Oprah allows us to sign these bundles. So the bundle publishing pipeline can have access to a private key. It can sign the bundle, and then the open side cars can have access to a public key and they can verify the signatures, and we know that those bundles are good and nobody's messed with the integrity of those. So I did that and said, right, I'm done now. James said, yeah, I'm not sure about that. Now you've got a key distribution problem. You've got to be passing this private key around safely. You've got to make sure you're passing the public key around. How often are you going to update those? So he points out that you can create your own custom signing procedure. And he goes, can you do that? I'm like, yeah, yeah, I can do that. The narrator, he can't do that. But where do you even start? So the great thing is in the Oprah Contrary repo there, there is an example here. So it's basically the framework is all there for everything you need to do to create your own custom build of Oprah with your own signer and verifier in there, but they didn't do it for you. So what's the signature even look like? So we have a look on the open site there. So signatures, what is the signature? That's a job. That's handy. I know about those from the identity side of things. So it's just a job in an array of signatures. So you've got the standard key ID, the algorithm that's used for the encryption there. And then the payload itself is just a list of files in the bundle. And then a digest and the algorithm that was used to do the digest. I reckon it might be all right now. I can probably pull this one off. And we're just implementing interfaces. So signing it, it's going to create a new bundle signature with the algorithm, the key ID that comes in in this configuration here. And then with the list of files that's provided for the bundle. Create a message to sign. So that's the header, base 64 URL encoded, payload base 64 URL encoded, drawing with the dot. Then we'll sign that, base 64 URL encoded signature, stick another dot in, bundle that all well up together, chuck it out, jobs again for the verify itself. Similar kind of thing. We've got the signature coming in. We'll parse it, figure out what key was used to sign it, and then verify it, pass back the list of files. So then we've verified the signature, the list of files that are in that signature there, pass that back, and open knows everything's good. So that wasn't as hard as I thought. So let's simulate the bundle publishing pipeline. You can see just using the alias for a KMS key there, signing that and pushing the bundle to S3 back to here. So basically we've just introduced AWS KMS keys for the signing there. So the open side card already had a token could get AWS credentials, exchanged that to get access to the bucket. So now the bundle publishing pipeline needs access to the key for the sign operation, and open itself needs access to the verify operation. So let's drop back into the risky land of demos. So just deployed a couple of workloads talking to each other. Let's check that. Awesome. So Istio is getting the X59 certificates from spy there. So we're looking at reasonably good. So we've got a couple of workloads. We've got some open policies allowing access on some endpoints and not on others. We'll just send some requests through, and we can see that open is allowing and denying requests based on the policies there. So that leaves us in a reasonable space. So yeah, the demo is going to be made available. Bear in mind that there's still a little bit of work to do this morning based on AWS's fantastic changes at some point yesterday. Thank you for that. Yep, we do have a summary, but we can go straight to questions. Just to really quickly summarise, is that a lot of the systems can use threat modelling, threat landscapes change, technologies change, keep threat modelling and do it iteratively. Obviously in today's world, Zero Trust is more crucial day by day. We've shown you SPIA, we've shown you Istio External Authorisation, we've shown you Open, we've shown you some tools that you can use, and we've shown how custom controls can be made as well. So the last step in the threat modelling process, like I said, is did we do a good job? We've done it today, but you can give us feedback by scanning the QR. So yeah, thank you, and any questions? Okay, well, we, like we said, booth SU57, if you want to talk in all detail, we'll hang around for a bit now. Let's know if you want to chat. Thanks for coming and thanks for your support.