 OK. Hello, everyone. I'm sorry if I sound a bit out of breath. We just did a talk six minutes ago and had to run from quite far away, so apologies. Welcome to our talk on the Argo CD end-user threat model. My name is James Callahan. I'm a principal consultant at Control Plane, and I'm here with my esteemed colleague, Torrin. At Control Plane, we are a cloud-native security consultancy established in 2017. We are security specialists in Kubernetes, cloud and container security, but we do this for highly regulated organisations. We're over 50 people now across the UK, northern Europe and Australia, and we are still expanding. So what will we talk about today? First of all, we'll start off with a quick introduction to what threat modelling is and why it's a good idea to do it. We always think about our scope when we threat model, and it's important to say that this threat model report was first of all commissioned by Linux Foundation, and it's not an audit of the Argo CD project itself. It's not a red team engagement or anything like that. What it is is focused on end users and people deploying Argo CD in production. What are we building is the first of four fundamental threat modelling questions that we must ask, so Torrin later will give you an architecture overview of our sample multi-tenant implementation. Data, of course, is the thing that we care about protecting. It's the heart of threat modelling, so we need to understand our data and data dictionary so we can understand the impacts of a compromise of any of those data elements. How do we model it? We do this through data flow diagrams. This is where we start to catastrophise and ask the question what can go wrong and codify these catastrophisations in attack trees. Finally, we ask the question what are we going to do about it. This is where we devise our recommendations. These are the end user aspects of what you can do to harden our example infrastructure. Finally, just to note that threat modelling is iterative. We always have to ask ourselves, are we doing a good job? Is the threat environment changing? Are our assumptions sound and are our controls working? A threat modelling overview. Threat modelling is a systematic approach we can take to our IT systems where we codify what can go wrong and devise mitigations which end up reducing risk. Threats can lead to risks, risks can be quantified and we will always be left with some element of residual risk. It's the purpose of threat modelling to reduce these residual risks in line with something which will be acceptable in our risk management framework and is in line with our risk appetite as an organisation. It aids in finding and addressing security risks based on attack chains and attack trees which Torrin will show some good examples of later on and controls most importantly will be quantifiable. We can understand the impacts of implementing or not implementing certain controls and communicate this with assurance, compliance teams and ensure that we are within that risk appetite. Finally, most importantly, we can enhance developer retention because people don't spend lots of time re-implementing the same fixes and fixing things which are broken basically. So why threat model? Because we identify security flaws early on. We should threat model as early as possible but any time now is better than if you haven't started at the start, now is the best time. We can save time and money again. We don't consume complex redesigns and lots of new tickets to fix things which are broken. We identify data flows and complex risks which we might miss otherwise. And then finally, yes, we focus security requirements and enhance developer retention. The threat modelling process, we touched on it in the first slide but we follow a simple four step process kind of a spouse by Adam Shostack and co. The first question, what are we building, involves understanding our data, our use cases, our adversaries and our architecture. So artifacts that can help us here, our data flow diagrams, system architecture diagrams, information flow matrices, all of these things can be put on the table looking at the system from different viewpoints and often elucidate threats which might otherwise be missed. What can go wrong is the next step. This is where our little world is exploding. We are catastrophising. We are using threat intelligence sources as inspiration. We can use brainstorming techniques such as Stride to make sure we're not missing any threats and we can put all of this stuff together in attack trees to see how an attacker would get from some level of access or some credentials to actually doing something which matters and results in data being compromised. What will we do about it is the next question and this is where we devise our controls. Everything, as I said before, has to link back to a risk management strategy and we have to devise controls which mitigate our risks within the context of this strategy. Finally, did we do a good job? This is the iterative aspect. We can do things like map controls to attack trees to see if any branches are not covered. We can write automated tests which test not the implementation of a control but its actual genuine effectiveness in a representative system. Finally, we can revisit our threat model regularly if threat environment, threat landscape changes. There are many reasons why we might want to revisit our threat model. In terms of outputs, threats are the key. Threats can be codified in attack paths. Attack paths lead to attack trees and attack trees are the means for devising our controls. Finally, this is an example of an attack tree but we won't go through this one in detail. Torrin will go through a much more detailed example in a few slides time. All we will say here is that all the attack trees in this report use deciduous, a tool by Kelly Shortridge. We can do things like introduce controls so the blue security controls as indicated here break that can possibly break parts of the attack tree and make some paths non traversable. Attack trees really help with visualisation. We can annotate nodes to provide extra context. We can even quantify things with likelihoods, impacts and do things which would feed attack trees into an overarching risk assessment. With that, I would to Torrin for a bit of project background. Awesome, thank you very much James. To get started here, we need to understand what our scope is and what we're actually trying to threat model here. For our project, as mentioned by James, we're not doing a security audit or review of the Argo project itself. Instead we are focusing on end user security considerations for actually deploying and managing Argo CD. To do this, we have created a sample deployment architecture in multi-tenant mode set up between a operations cluster or ops and multiple, in our case, three tenant clusters that it is synchronising application state to. In scope for our project here is that architecture that I previously mentioned, as well as the Argo CD interactions with the cluster components that it works with directly and as well the Getty source control management for actually managing the source repose for each of the applications. Out of scope, we are not looking at a security assessment of peripheral case components as well as the developments and building of the actual applications that are being shipped by Argo CD. For our case, we are setting up the deployment architecture with demo apps, but we're not looking actually at the assessment of the applications that are shipped to themselves. On top of that, we're not performing a penetration test or any sort of offset analysis of the system, and we're not looking to provide a hardening guide towards specific data classifications, whether that's PCI, PII or HIPAA. So what is the project overall? For those interested, we have the demo architecture on our public repository, although right now it is in a state of becoming open source. So stay tuned for that in the coming days and week so you guys can actually set up that yourself. But what that architecture looks like is Argo CD deployed to a private EKS cluster with an ops cluster that hosts the Argo CD control plane and I'll kind of get into the architecture specifics in a moment, but then it's synchronizing to three different tenant clusters within that EKS cluster. So for those interested, feel free once that's open source to try it out and test your hand at deploying it, as well as even trying your hand at developing some attack trees and numerating some threats. So for our threat modeling exercise, our published report is on the Argo project GitHub repository right now, so feel free to check that out as well if you want to see the detailed run-through of not only what we performed as far as security analysts, but also the controls and mitigations that we have provided as recommendations for what we found. And for our findings here and the key threats and recommendations, we are going to demo some attack trees and show how they enumerate some of the threats and map them out in a specific exploit chain later in this presentation. And those are mapped primarily on critical, high, medium and low severities to provide a risk strategy for actually prioritizing and categorizing them. So a few key assumptions starting here. Tenants are strategically assigned app teams within the same org and currently for our deployment sake we have tenant zero, tenant one and tenant two to simulate the multi-tenant mode deployment of ArgoCD. And ArgoCD deployment itself is deployed using Terraform modules and ArgoCD is running its control plane on the ops cluster and is synchronizing state to the tenant clusters but is not actually installed on the tenant clusters themselves. It is maintaining a synchronization between the ops cluster and the Getty source control for those repos. For repository organization we have Getty server hosted on the ops cluster itself and we have one CD repository per application team with separate folder for the manifest stored within that repository itself. For the synchronization process we have autosync set up for each of the tenants and that is triggered based on commits to the main. So whenever commits are being pushed to the main repository of each of those app repos hosted on Getty then autosync will be triggered and then push downstream to each of our tenant clusters to update and to synchronize that state. So what are we building here? Let's look a little into the architecture that we're working with before we can actually understand the scope and to see how we threat models our system. So, as I mentioned previously we are deploying this on an AWS EKS cluster which is hosted in a virtual private cloud. As far as access is concerned and if you run through the demo itself you'll be able to SSH through a secure bastion host that actually allows you to issue commands from your local machine or that's through the API directly or through Qtl to interface with the clusters to see what's going on and to investigate. For the setup we have, as I mentioned we have three different tenant clusters with tenant cluster one, two and three which are being synchronized up to the ops cluster itself which holds not only ArgoCD control plane but also the Getty server for source control management where we actually are storing the application repositories themselves. So in multi-tenant mode and as many of you may be familiar with in deploying ArgoCD we have set it up between our three tenants and running the tenants as app project resources and the applications as the application resource in ArgoCD. So I provided a sample of the actual app project and the application resource manifests and you can see how we're actually setting the destinations here so in the app project itself the destination of the tenant zero cluster and all namespaces providing ArgoCD with cluster-wide access to synchronizing the state of those applications in our other clusters so in tenant two and tenant three those are actually being synchronized at... sorry, in tenant one and tenant two those are being synchronized on namespace level only and not cluster-wide. You can also see the selection of the resource repos that we're actually using in creating in Getty and when you run through the demo you can create these over the UI or create them through CLI commands whichever way you prefer and they are linked through the application itself defining which ones should be synchronized to each application. So an overview of what this looks like and many of you might be intimately familiar with the UI offered by ArgoCD but you'll see that we have our app one, two, three and four applications each of them healthy and synchronized to the app cluster itself and the ArgoCD control plane hosted within it and here you'll be able to update and configure the manifests and the configuration for the apps as well as the app projects itself to play around with the demo itself. After we understand the architecture overview we need to understand what our critical data assets are so what we are actually concerned with protecting and what a simulated attacker or a malicious threat actor might be interested in compromising so to do this we create what is called a data dictionary which is essentially acts as a blueprint of your critical data assets so to model your dictionary you need to answer a few questions so what data is crucial to your system functionality what data or secrets are used to secure your system this could in many cases by default is Kubernetes secrets or some sort of API token and then what data if misused could alter or disrupt the system itself all important questions in their own right that allow you to fine tune and focus on the actual critical data assets that are ported into your system in our case to ArgoCD and Getty synchronizing AppState so once you identify this you can then begin to categorize each of the identify assets in our case we use the RAG approach or the red amber green which associates red with high and then amber with medium and then green with low risk levels for our data assets this is not the only approach we took for this threat model exercise so I'll do kind of a quick run through this but for those interested there is a detailed run through on the Argo project end user threat model report that you can find that runs through each of these but these are some of the enumerated critical data assets for ArgoCD so for example the Kubernetes service count token for ArgoCD as well as the init admin password for actually setting up different users and configuring ArgoCD from startup we also enumerate local user credentials we will actually accessing and manipulating the applications and app projects within ArgoCD and then we assign them their RAG classification or their high, medium and low risk priority and we put that against the CIA tryout so the impact of confidentiality, integrity and availability of end user data is then mapped to our RAG approach which is put into our data dictionary and allows us to begin the threat modeling exercise and start moving into attack trees but before we actually touch into attack trees themselves we need to still kind of model how data procedurally moves through the system and to do this we create data flow diagrams and this acts or answers the next question of how you model your system and to begin you want to start with an architecture overview but once you understand the system you're actually creating and modeling then you need to understand how you're actually moving and flowing through it so to do this we've created two levels of increasing granularity of the system itself and you'll notice that the dotted green line that moves between the system itself is denoting the trust boundaries right so the namespace trust boundaries and as well as the cluster trust boundaries that needs to be transcended for data to move in and out of your system and interact with the different components as well as in our case to interact with the different applications so here we have our ArgoCD hosted in the ArgoCD namespace and Getty in the Getty namespace both within the operations cluster and to move out of the cluster itself we need to transcend first the namespace and then cluster trust boundaries to synchronize to each of the tenant clusters which we'll see in the next data flow diagram so level one gets a little bit more granular so it shows each of the components as well and how they are operating within ArgoCD and the tenant clusters so after transcending the operations cluster boundary then the auto sync will trigger into each of the tenant clusters in the case of tenant cluster one that is cluster wide so ArgoCD has access to the entire cluster and is able to synchronize across namespaces in our case with app one and app two namespaces tenant one tenant two and tenant three those are namespace scoped so it's two specific namespaces in our case for application three and application four is what ArgoCD actually has permission and is directed to synchronize so once we understand how data is moving in our system we understand our architecture overview we can get into the catastrophisation kind of as James had mentioned with creating attack trees so what can go wrong what is our doomsday scenarios what is a potential threat actor looking to exploit in our system so to begin I'm going to give one of our larger threat model attack trees for this specific use case which maps out five high priority threats in our case in different auto the box configuration as well as different threats we have enumerated from this threat modelling exercise and a key note before I kind of get into one of the specific attack chains here is the reality step you start at the left hand side here of the reality where the attacker is and typically you'll break this out into an external and internal attacker internal might look something like your infrastructure provider or your cluster operator or system administrator whereas external attacker is someone outward looking in on your system itself but there's a lot going on here so we're going to break this down into a specific attack chain and see how an external attacker might move through the system so from an external attacker's standpoint they are looking at an eventual takeover of a tenant cluster on our simulated infrastructure so to begin to do that they first need to enumerate some credentials and a critical asset to doing that is the argocd service account token to be able to maliciously configure or otherwise enumerate secrets within argocd's deployment itself so moving from the chain you can see that after the attacker first enumerates the cube config credentials they need somehow to access the secure bastion host that might look like socially engineering a developer to get their credentials to SSH into the system itself or some sort of phishing effort and once they do that they then can access the network itself and can begin to issue malicious commands on your tenant clusters so at this stage the attacker has cluster admin permissions so after lifting the cube config credentials and being able to steal the SSH credentials of in this case the app developer they then can move further and transcend your trust boundaries in your system and now are operating on your ops cluster and from here they will look to maliciously configure your downstream tenants and try in our case to deploy a malicious deployment to those tenant clusters so once the service account token is compromised and there is a list of information about the argocd deployment can be enumerated this looks like something like the init password which is not deleted by default but is suggested so by the project as well as other secrets and tokens stored on your clusters so from here the attackers now have cluster wide access so then they can move from your ops cluster into your downstream tenant clusters and begin to attempt a malicious deployment into your tenant clusters in this case the end goal for the attacker is a full cluster takeover which is possible through the steps that we have outlined and ran through here after reading the cluster wide secrets enumerated on the ops cluster then they can move and transcend to the tenant cluster itself so this is a larger example of a mapping of a few of our high-party threats but I'm going to look after this into a more granular focus on one of our other attack trees so this attack tree looks at the tenant cluster takeover due to the unrestricted default project of ArgoCD so starting out you have your reality and from reality the attacker can then begin to enumerate some of the initial passwords and specifically the initial admin password of ArgoCD to maliciously configure ArgoCD and as you were running a little bit or low on minutes so I will glance over this one and then we'll look at some of the security threats and recommendations that we have enumerated off of this so what can we do about it so what are our controls what are our mitigations that come from this threat modelling exercise so some security threats and considerations follow a few key questions here so who can access or who can ArgoCD access in this case looking at giving ArgoCD namespace specific or namespace scope access as opposed to cluster wide access and limiting the permission of service account tokens in ArgoCD's deployment itself as well as using workload identity this could look something like Spiffy Inspire and there's a lot of other implementations as well to connect to the tenant clusters themselves securely and on the second question who can actually access ArgoCD so this kind of gets into our multi-tenant deployment as well as disabling some of the out-of-box configurations specifically the local admin user that's actually created with admin level access to ArgoCD to create other users and modify your app project and application resources to protect the credentials themselves using an external key management service and then using SSO to enforce MFA if it is applicable and is not overkill in your use case as well as limiting user account base on RBAC this can be done natively through Kubernetes as well in our demo deployment and then finally how does ArgoCD actually manage applications so in our case app projects and application sets the first recommendation here is actually deleting the default ArgoCD project as it has a lot of insecure out-of-box configurations that are very useful for setting up but should be deleted and at least controlled if it is not deleted to begin with as well as limiting the repos and clusters that app projects can manage to limit the blast radius of an attack if it were to be realized and finally admins of whether that's an application admin or a system administrator themselves should be the only ones to actually manage your application sets this kind of gets into the principle of least privilege here at the end so there's a few high priority findings given we're kind of short on time here I won't run through each one of them the ArgoCD and user threat model reports on the Argo project repository and it will go through each one of these and kind of give you some recommendation strategies on them so to the last question on how can we improve and we need to first kind of understand did we do a good job to iteratively approve right so through the exercise that we performed on ArgoCD we numerator 19 threats six of them being high priority based on the stride approach and provided recommendations and mitigation strategies for each of them and key to note here is the actual mapping of the high priority threats on our two attack trees and implementing controls to break those attack chains so a quick thank you to Linux foundation and the Intuit team for helping us with this and thank you everyone for listening in