 We'll start off going through the existing lightweight threat model we did for Flux. Flux is a Githops tool. Let's see if I can multitask. It's the original Githops tool. We're talking about taking static configuration twfl concerted cubonetans or manifests. Putting them into git. That's a good start. What's on the calendar for today. Yes, GitOps, taking a declarative configuration, pushing it into Git, and having a deployment tool, none of that's relevant, and having a deployment tool reconcile cluster states with the state of Git. So at a very high level is what GitOps is. Flux is a tool from Weaveworks that was the first tool to educate the market if you like. The first instance of a tool that does this kind of thing. And, Marko, would you ping me the link? Thank you very much, you'll weigh ahead of me as usual. And we will whiz through some introductory slides to talk about how this all works. So the goal with the lightweight threat modeling in general is to facilitate the advancement of software through the CNCF through the graduation process in particular. So when a project is submitted to the CNCF at the beginning of the process, if the project has security side effects, so it has cluster admin credentials, or it's used to enforce something with a guardrail, or it's an observability tool, again, with heightened privileges, tag security is asked to have a look at it, to model what could go wrong, and it is a wide, not necessarily deep look to make sure that there are no foot guns, people have all their limbs at the end of the deployment process, and in order to achieve that, tag security has a detailed self-assessment. This self-assessment goes to the maintainers and asks them a specific set of questions, how do people use this, what do you think could go wrong, and then we collaborate with people in tag security working groups and people from the project to build out this self-assessment, which then goes back to the TOC and is used to help determine the maturity of the project. Sometimes a project can't graduate from one stage to the next without making remediations as based upon the recommendations that come through from these things. This is quite a lengthy process. The assurance required to make sure that these projects are doing what's described on the tin and nothing egregious because we are a fully volunteer-led organisation can take some time. The goal of this new lightweight threat modelling process is to reduce the latency for these projects. We're still doing the self-assessments because they're incredibly valuable, they're deep, and they give us enough of an understanding to actually advance and recommend these projects. The lightweight process is to quickly and efficiently get feedback to the project and help to guide from a high level, again looking for breadth and not depth, and ideally to do this within a single hour-long session. Those are lofty and potentially unachievable goals, but it is a moonshot. This began in Tag Security issue 903. Let's pop that as well. The process is based on the Mozilla Rapid Risk Assessment framework. We had some working sessions where collaborators from Tag Security got together and collaborated balancing the different types of threat modelling approaches. The way control planes do this, and in fact just to sort of the issue details, a lot of people contributing some really useful different perspectives at light speed. We democratised this. We wanted an approach that would attract contributors. We didn't want to do something esoteric and unusual that nobody would then want to be involved with. We looked at various different approaches. We have, for example, the stride approach, which comes out of Microsoft, this Adam Shostack's typical approach. We also looked at this Mozilla Rapid Risk Assessment. There are other permutations. There's a pasta way of threat modelling. Ultimately, all of these things have the same goal, which is understand the system, ask a set of pertinent questions related to its security, identify a path through mediation, and then do the whole thing again. Add in for an item, add nauseam in some cases, and identify when we want to repeat that process. Is it to do with a graduation event in this case? Is it to do with a new feature release? Is it to do with a dependency tree shake-up? Whatever it is, it can be arbitrarily decided. So, that's the 50,000-foot view, lightweight framework, sandbox incubation. We looked at all the different approaches that people suggested on the issue, and Trail of Bits, who are a security company out of New York, who did the first Kubernetes code-level security review, put together this questionnaire format based on the Mozilla Rapid Risk Assessment. It didn't entirely fit our requirements, so we modulated it very slightly, but broadly it was excellent and it was fantastic to build on. What is the Rapid Risk Assessment framework? It comes out of Mozilla, and it's used to build Firefox. So specifically, I'm a Rust developer doing my thing in Firefox, and I want to ship a new feature. I fill out a Rapid Risk Assessment, and I go to security and I say, this is what I want to do, can you help me threat model it? So, for example, let's say we're going to ship secure tabs. So, Mozilla shipped cookie isolation for tabs a long time ago, I guess, now. There are privacy concerns. There are potentially information leakage concerns. There are functionality concerns. Actually, we just block access to everything, and while we can't be tracked, we also can't authenticate anywhere or persist the session. And so, the assessment asks, what's the feature, what are the side effects, and ultimately democratizes the process, again, of understanding what could go wrong and how will we fix it. The risk there, of course, that that is specific to feature delivery in a very specific context, which is a browser, which is complexity in itself. So, this is the foundation that we've built this on. Again, the modulations and changes are to focus on a component of a cloud-native system rather than a browser. That, again, that is the 50,000-foot view in Toto, and there we go, that's the slide I perhaps should have advanced to before waffling on it. The rapidity is such that we aim to do it very quickly. Keeping things at a high level gives us enough of a macro view to fix the foot guns without, for example, performing a code review. So, there is no code review in this process, concise and readable, easy to update. As software is constantly under development, so should the threat model be updated in tandem with it. This risk impact in data dictionary is helpful to a new maintainer to come and say, oh, it's been threat modeled. How do I understand how things are classified and what the sensitivities are? And yes, then it finishes with the recommendation. So, I think that means we're ready to have a look at what we did. We started with Flux, as they say. I can't actually read the screen from here. I think it's probably that link that I've mis-pasted in. So, Flux is a deployment tool for Kubernetes. This concept of GitOps is take a static state defined in, that's not the right one, apologies, a static state defined in Git. Use something with cluster admin credentials or some highly privileged deployment credentials. I'm really struggling to click the right link. I'm so sorry. That is the one that we want. I'm sorry. I'm struggling to see it. This guy. Do you know what? I'm just going to move it to this screen. There we go. Far less than that. Okay. Thanks for your help, Marco. Much appreciated. All right. So, just to refocus. Flux allows us to deploy stuff into a cluster in an automated fashion. So, instead of a developer having cluster admin credentials and doing a Qt control deploy, instead we've moved to a pull model. It's an inversion of control. It takes credentials out of a build server and instead gives them Git pull credentials to something inside a cluster. So, the security boundary ceases to be rights access to production. Instead, it's right access to a Git repo. Then we use branch protection. I wrote a white paper called Hardening Git for GitOps for Weave a number of years ago and the model is incredible. I really appreciate it a lot. So, Flux was the first project that we chose to assess in this way. I'll just pull this over here once more. So, what do we do in order to run through this? This threat modeling rapid risk assessment is basically a script. So, we've scripted a standardized threat modeling notes in introduction and then we sit with the team on an hour long call and we run through the documentation and we try and understand exactly what their intent and intention for the project is. So, in this case, and again, I feel like I should probably just maximize this guy. Okay. I've got a project. It's a GitOps deployment tool. Project data classification. Why is it critical? Because it holds cluster admin keys and what's the worst that could happen. Maybe someone gets remote code execution into the container. They have cluster admin keys and then they can enumerate secrets, deploy whatever they want to run previous containers and pop the nodes, all that good stuff. Okay. So, first of all, introducing the project with connections and protocols. These are threat modeling notes up here. So, this is the first stage of any threat modeling process gathering as much metadata, information and documentation about the project so that everybody has a common security lexicon and knows how to describe the various components of the system. We already had the genesis of this in the self-assessment and so what we can see here is reasonably we can just about see here is fluxes runtime behaviour and model. So, we can see at the top the internal and external interfaces we have a bucket so an S3 bucket, we have a Git repository communication services container registry and external event selectors. The reason that there, so this is a data flow diagram, the red boxes around things is because those are trust boundaries. Trust boundaries are something in threat modeling that are slightly intangible. First of all, what's trust? It means that we like something sufficiently to give it some level of I'm going to make a self-reflective recursive definition trust. So, when data is travelling between two different parts of a system a trust boundary might be a step up so if we're lower to a higher data classification could be the reciprocal step down. It could be the fact that two components are written by the same team. So, if somebody ships a malicious update into one well it's the same team anyway so they're probably in the same trust boundary it could be an external interface from outside of the system even though the data is still at the same classification we're passing through uncharted waters almost. So defining those trust boundaries is a little bit ethereal and slightly subjective but it's a required first step because again this is a very abstract process and it gives us a common way of discussing and describing things. So, in this case it's relatively self-evident the external surface there is where Flux says where is this Git config stored? Probably in Git sorry, the YAML config probably in Git, bit of course it could also be in a bucket etc etc other things we have in there and then the set of Flux controllers in the middle so it's just about readable those controllers are what reconcile the states of the repository or the YAML and the observed state of the cluster so it's a reconciliation loop that is constantly if I can remember observe, act I feel like there's three parts to it there's definitely observation and acting actually remember what the third one is but nevertheless it's observing the state and reconciling the expected state with the observed state and it's eventually consistent of course because the cluster is a distributed system which is constantly changing so at a very high level we can see obviously there is a lot of complexity to Flux so moving through this in an hour is inherently its own challenge so what do we do next how do people use this project okay we're deploying with GitOps to Kubernetes it's the only component making changes to a cluster this is a super important GitOps principle it helps us to reason about the security of the system because there's a limited number of people who can do anything to mutate the state and there is some new functionality in Flux which is this Terraform experimental controller the important part here is that we've identified something in the threat modelling process that relates to potentially relates to the threat model but it's out of scope scoping from a threat modelling perspective is how we attain an outcome in a reasonable amount of time that's because if we're going to threat model Flux do we also threat model the container runtime do we threat model the node that Kubernetes is running on do we threat model the data center that the node et cetera et cetera ad infinitum so in order to actually achieve an outcome from a threat modelling process we down scope as much as possible and that laser focus gets us to a conclusion that we can do something with as opposed to just sort of threats everywhere in a buzz lightyear panscape style so in this case Terraform experimental controller it wasn't anything that was in the headspace of the team it's extraneous to some extent to the core functionality so we de-scoped it, easy decision so how does the project work and at this point you can see with these are just really simple heading, question and answer it's a aquatic threat modelling ooh maybe that's what we should call it how does the project work yes well we're looking to get that state and pull it from somewhere and apply it into the cluster are there any subcomponents or shared boundaries this yielded some interesting results so we have these controllers the flux is actually a V2 the V1 of flux I'd used extensively I discovered a whole lot of things about the developments and the changes to flux that I had no idea about it can pull in customized code so it can run ad hoc transforms as customized does it can do helm so it can just pull those helm charts render and apply them in process helm also has a load of deployment hooks which showed extensibility potentially places that malicious software could run the source controller of course is where the source is being pulled from we saw those external interfaces and the image controllers so there is also this concept of flux deploying things and writing back a git tag to the repository that it's pulling the code from and that tag works as a pointer sort of a deployment offset so in case of disaster recovery if the whole cluster is a smoldering hole in the ground it can just redeploy it from the same repo and it's a reasonably deterministic thing those credentials back to the source repository that's otherwise read only the security side effects of the small functionality decisions of what we're hoping to elucidate with this process so what communications protocols does it use at this point it's worth deviating slightly to talk about what we do at control plane which is we perform these threat models and assessments for every activity we engage in we do it for all our customers it is the way that we rationalize quantify and justify the security controls that are going to cost money and that people would like some assurance actually work and are optimal for some some degree of requirements the way we do this is with a large a large matrix and in fact Marker would you mind just pulling up or just linking me through to the threat modeling training and we do have should have thought about this ahead of time, I apologize we do this in a data classification matrix so we look at all these flows we classify them and again it's a common taxonomy and language that makes it easier to visualize security let's say so what we've done here for the sake of rapidity is just list out a load of bullet points but there are other there are other quicker there are other less quick ways of achieving this as well I will also mention that control plane have a free threat modeling Kubernetes course that we do for O'Reilly so if you're on the O'Reilly platform I think we're streaming the next one at some point in the next couple of months we will make some noise when we do it and all of the collateral that goes with that is all open source and free as well so for the communication protocol taxonomy we care about what these specific parts of the system are doing and how they communicate why does that matter? because there are different potential problems with each different style of communication if we're just using TCP well hopefully it's actually bundling some sort of encryption around the thing it's doing is it packet and packet is it TLS are we using insecure ciphers are we using old SSH versions et cetera et cetera so that's what we're looking to determine with this reasonably exhaustive classification exercise so just whizzing through those source, helm, customize notification the point of doing this again is to build this kind of abstract problem space in the minds of the people undertaking the exercise so that when it comes to what could potentially the problems be then we're thinking actually there is a metrics end point I wonder if that's authenticated if it's Prometheus is someone going to check that it's not leaking some sense of information people dumping environment variables into those things that kind of nefarious attacker driven concept, yes I waffle a bit more here than we do there we make a fine point it's an excellent point so the question is how can we be confident in the completeness of the model when balancing that with the speed and compression of one hours time ultimately that is the risk balance we can't is the answer the compromise is such that and it's not an apple to the oranges well sorry it is it's an unfair comparison but for the point of demonstration tax security are also going through an Argo threat model assessment Argo is decomposed into four different parts we've got four different work streams and it's all volunteer led and it's taking us a lot longer than we'd like to do it so we are performing those exhaustive deep introspection on the thing as well but this is really meant to just say what's the most effective and lightweight thing just kind of referring back to the title it's not helpful but yes we're intentionally balancing that risk specifically to your point of how do we know that these are correct the project did fill in the self assessment beforehand they shared the documentation with us in advance so I'm being maybe a bit overly I'm glossing over the fact that there was preparatory work done by the people who attended we also had I mean highly skilled maintainers who have been on the project for a long time so they were just bang bang bang bang but as we go through this it is a question and answer exercise so there were points where things were missed and we said oh but what about this and then it was expounded on slightly so it is definitely fallibility introduced by process which is not in any way meant that it's funny that you say that yeah there's um I think as security people on the hook like the buck stops with security you can't really if you make a mistake it might have PII related implications yeah that's a really good point this is done in a very I think the spirit of the thing is like a blameless pre-mortem idea and certainly going through that with the I mean one of the things that I try and prefix to to any of the kind of the hacking demos or those kind of things that I do thank you to all the maintainers who've put all the effort in to give us all this incredible tooling yeah it's a really good point there is there can be a motion involved for the people involved and actually being the subject of one of these things can feel a bit like an audit sometimes for people and they're like yeah yeah but I mean it's uh yeah I suppose I associate audit as more of a sort of looming dragon of compliance whereas this is like we're trying to help guys that's really interesting yeah I'm taking yeah I'm taking notes for things to blog about yes okay so running through and hopefully getting as close to a reasonable degree of certainty as we can with these communications protocols so then we start getting into the meat of the problem thinking about and actually decomposing things down in my head to Linux namespaces is quite a useful way for me to think about this so what's in the process namespace well that's the runtime, that's behaviour what's in the mount namespace well maybe the secrets are at the disk those secrets might cross over if some of the mount are actually pulled into the application for usage maybe they're decrypted on disk and so they're slightly safer sorry maybe they're encrypted on disk and so they'll be decryption keys somewhere else that we need to use they've got network namespace what are the inputs and outputs coming into this back to the communications so I haven't scrolled down so the question is where does the application store data so we're thinking about the data that we've classified previously when it exists on disk what's its form is it breachable in the CIA triad so it doesn't need to be confidential doesn't need to maintain its integrity doesn't need to be available all the time sometimes the answer is no, strangely enough do we need certain types of data all the time well not if we've got a fallback sometimes metrics for example you can just fire them or fire UDP if you've got a saturated network connection you probably want the metrics to be dropped before you want application traffic system that kind of thing in this case what does flux actually do well it doesn't persist anything which is useful so immediately we can de-scope some of those confidentiality pieces but some data is stored locally where would I tamper with the data and at this point we're into that nefarious attack of mind set and this really is the value of this process because as developers we like to find path stuff we like to get the functionality out and security can be seen as a roadblock in these terms so saying to a maintainer if I was going to try and cause trouble for your users what would I do well man in the middle on the source controller because it doesn't serve TLS we know this because we have seen here that we've got an inbound connection serving artifacts on HTTP so that process has triggered some of the initial thinking to start building out the actual model the team were aware of this already there's an RFC in flight but we look to capture everything so that we have as holistic a view I hesitate to say the word but as wide and expansive a view as possible within the constraints of the scoping that we've chosen of course sorry I cursed me I'm just going up and down anything else here of note that Kubernetes CVE is a do not fit or is it won't fix so again we raise it here actually do we want to scope it if it's not going to get fixed upstream seems a bit redundant we could generate a whole document of things that are there bugs or features from that kind of perspective but we do have a control a suggested control there and again another man in the middle the CRD's explosion here is really quite significant but we're not storing sense of information and then we've got a final question about hard multi-tenancy if we've got a flux cluster controlling multiple other clusters well then potentially the threats there are around organisational responsibilities and who's actually managing those things so we are just whizzing through and let's just see what else we've got so sense of data and where it's stored we've got credentials here and if those credentials are exfiltrated we've got escalations potentially not only for things within the cluster but any workload identity integrations where we can exchange those credentials for cloud credentials of some description may lead to what ultimately becomes an account compromise or an organisational compromise just from someone breaking out of a container these routes to escalation are often possible they do have as we can see some useful sequence integrations with SOPs etc etc okay data storage with the CRD's do we have encryption on the things that we value no so as a recommendation of course we would ask the project to seriously consider encrypting things because it protects against somebody compromising a node sniffing network traffic sniffing and reduces reliance on an encrypted CNI of some description okay then we get into the data dictionary and I think we might be pressed for time if I don't go slightly faster so yes the data, the classification and any thoughts so at this point we've gathered enough information to start addressing some of these high level risks again because of the speed at which we are going the depth of these is not guaranteed so again this is scripted areas of controls that we're interested in based on what the audit working group selected when we say controls we mean a logical section of an application or system that handles a security requirement system may have authorization requirements etc we then detail the control families that we're interested in and these are again they're very similar to what the cloud native controls for the trail of bits order sorry for Kubernetes look like but with a focus on the usage of the system and not the code that authors it and this really is the instructive difference between most traditional threat modelling which is feature based looks at code considers how things are built and what we look to achieve specifically with this framework how control plane does this slightly differently which is the end user usage runtime behaviour and integrations and infrastructure usage of the application so we're looking to secure something that's more in the realms of a cloud misconfiguration than a kernel busting CVE gives a slightly different kind of value and they're both entirely useful in complementary processes but just worth noting that they're both called threat modelling hi Nicola okay so deployment architecture what does our runtime look like sorry I should probably just close signal for the purposes of demonstration there we go how do we deploy these things at runtime why is that important because a vulnerable application can be deployed with a lockdown security configuration that makes it safe to run you may not want to do that by default but if the cost if the choice is between running a trading system in production with a vulnerability or not trading then probably the answer is obvious we'll still have to move through this quickly networking, cryptography, multi-tenancy isolation, secrets management storage or N and Z audit logging and security tests and then we can skip entire groups of families if we so desire and so we want to know for each control what does the project do what's the data classification and then we get into the attacker mindset what does the attacker want to do here how would they attack it are there specific mitigations for example can we just stick some sort of firewall or inspection in between are there availability concerns and have there been similar vulnerabilities in the past it is a case of history repeating it's again deeply instructive to go back and look at how these systems have been attacked previously and help to derive controls there so and I'm not sure I think we've probably almost hit time I will just whizz through the end of this into our threat scenarios these are the things that we found in the assessments so theoretical threats what about deploying into the flux system namespace what would happen if someone gained access to those credentials what about image pool policy this is an old school issue basically can we retrieve other tenants images from the local container cache valib docker whatever it used to be and what about shared secure multi-tenancy multi-tenancy in Kubernetes is really difficult the DNS for example is just defined to show you everything you can see all the services environment variables pointing to everything else that's deployed as a service in the cluster anyway what about network policies okay we get into the multi-tenancy lockdown and it goes a little bit deep but again kind of to the point these are things that the security engineers in an hour ideated and it's just bang what do you think about this yeah that's potentially a problem gets captured and we move on to the next the important thing generally when threat modeling is not to conflate the threats and the control in the same breath because the stride process delineates the what are we building what could go wrong and what are we going to do about it is three distinct parts of the process and this is because it can be difficult to keep the high level view while also thinking oh but one of my favorite controls is X for example so we throw that entire approach out with the rapid risk assessment everything really is done in such an intense period of time that controls are suggested in flights so anything else this is all public of course so everyone is welcome to jump in and have a read in more detail and we went through and recommended our controls as I say these kind of conflate a little bit here in places yeah so for example here we're just going straight into recommending a specific admission controller where we wouldn't typically do that and then with a view to actually and we kind of as you can see jammed some of those in further above and then we go through and actually recommend the controls that we think the project should should well apply to remediate the NS poisoning that's the do not fix but we do have a miscation for it and then write out official recommendations and that was an hour's work huge thanks to everyone who is involved with it it provided enough value to corroborate some of the security decisions the project had already made and to generate a few extra issues I really went on for much longer than I have tended to going through that so rather than doing anything in flights we will draw it to a close any questions or comments before we do so the question is would specific questions on container security context or best practices be useful potentially I would put those under the deployment architecture pod and namespace configuration section probably see it on the left hand side if I did yes so we do have a section for it that was they said everything is fine we don't think there is any problems with the deployment there so I guess yes potentially the quality of the output is really based upon who turns up to contribute at the time so if there is somebody who has specific kernel API level knowledge they are going to be asking the questions that lead to those answers so yes potentially but I think balancing the yeah I hope that is alright but I am interested in what specific questions you might add there as well yes so I guess it would potentially be a question in the previous yeah I see what you mean now so to actually put it into sensitive data that's a great shout yeah I think that is worth taking back to the group and suggesting thank you any more okay thank you very much for your attention everybody that was a light speed, lightweight version