 Hi everyone, my name is Moshe I'm the founder at Flanksource. At Flanksource we're building an internal developer platform for GitOps fans. I've been a GitOps fan I think before GitOps even became a thing. I've spent quite a bit of time in the open GitOps working group helping trying to define the principles of GitOps before that spent quite a lot of time in the Kubernetes community especially in the cluster API group and today I'm going to talk a little bit about our journey at Flanksource building a GitOps enabled SaaS control plan. Let's get started. So when we looking at building a SaaS control plan or building a SaaS product the first decision that we needed to take was what type of architecture did we want to build it using and there are really two different types of architectures so a shared nothing and a dedicated model so the dedicated is often a hybrid model where you have some dedicated infrastructure some shared infrastructure and the characteristics of a shared nothing architecture being that everything is shared costs are quite low but your customizability and flexibility is quite limited and on a dedicated SaaS architecture where you're actually provisioning infrastructure for every tenant the costs are quite high but you can achieve high levels of isolation and you can provision services that they have a lot wider range of offerings and you can if you're doing a purely shared infrastructure. Mission control is really a lot of has a lot of orchestration capability and a lot of scripting and this really made an easy decision for us to use a dedicated SaaS architecture to be going to look at how we went about building that before we do that we want to have a quick recap on what GitOps is and what it really means from a control plan perspective so first and foremost GitOps being decorative and this is the foundation on which everything GitOps is built on so rather than applying operations to your control plan whether it's by event management UI or click ops or CLI ops everything that you're doing on that control plan and on tenant infrastructure should be declared upfront and once it's declared that declaration should really be versioned and immutable. Git is the most common way of doing this but you can use a database there are some challenges that I'll get into when using a database for GitStore certainly possible and the third attribute being that all changes must be done pulled automatically and this is where tools like FlexCB and Argo come in and these will also reconcile continuously so provisioning the infrastructure once and then forgetting it isn't good enough you really need to continuously check that infrastructure to make sure that if there's any drift between changes in the state store that may be your triggers must and it is reconciled eventually and you get an eventually consistent system that doesn't have any snowflakes in it. So what is a control plan and what are the capabilities that we're looking for from a control plan? First and foremost provisioning so you want to be able to provision infrastructure in an automated way. This provisioning takes to be a very long life cycle so you could provision some infrastructure upfront but ultimately you're going to need to upgrade that the software versions perhaps you're going to do upgrades on the underlying infrastructure maybe migration between workloads in the clusters so this provisioning is a life cycle rather than a once-overbent cost is a really big factor so especially if you have trial users and you need provisioning infrastructure it can be a challenge to maintain that cost and keep the cost low resource placement capacity planning or come into how you manage costs on your tenant infrastructure. With all customer data you need high levels of security and if you're provisioning infrastructure you probably want to reduce the amount of lateral movement that attackers could do if they do breach a single tenant so you want to isolate your tenants and make sure that they don't allow lateral movement and finally probably doing SOC or ISO compliance certifications which give you a long list of things and evidence and processes that you need to keep and follow. So what is a non-giddy up control plan look like in order some of the challenges here? So we'll start with a typical operation using clerk is a user authentication and tenant management system that's a SaaS a lot associated with them but we do use them at that plunk source so we have a request that comes in and that will hit a database and normally gets stored in the database it might go into an event queue for an event sourcing system but more than lighting it's going to hit some database of sorts from that database will have your provisioning service or infrastructure go and create the resources for you this could be a combination of Ansible Terraform cloud formation you might have some scripts in there you might have a Hoverbaked framework for this once that initial provisioning is done normally you're going to need a management UI on top of this and this management UI will probably speak to your provisioning service maybe some other services to get the state of infrastructure and tenants so that data operations can be performed and then you're going to come to an audit and the auditor is going to ask you a long list of questions around your database and those questions are quite aligned with this version then mutable step of your declaration so you could make a database into a GitOps compliant data store but it requires you to implement change tracking write worm audit trails to apply a low-level ACLs to really have a very mature database footprint which is non-trivial to implement and even then once you've got a very mature and secure database implementation for your data then you'll ultimately get these big customers or incidents where you need to make a minor tweak to something and there's no obvious way to do that unless you bolted into the management UI there are no escape patches so when we look at building a SAS control plane that's GitOps enabled what we're really trying to do is build isolated control loops and control loops are the thing that makes Kubernetes so wonderful and amazing and I'll show you now what that means so given a Kubernetes deployment you'll have a controller that looks at that deployment and goes out and creates a replica set this operation is one isolated control the replica set isn't necessarily aware of the deployment the replica set controller doesn't depend on the deployment controller and vice versa they're operating isolation of one another if the deployment controller is down you can manually go and create replica sets yourself and the replica set controller will function just fine and you have this chain of isolated control loops that work across the board so you have the replica set which is another control loop that goes in and off a pod to a scheduler and scheduler will hand off the pod to a kubelet and then you'll have CSI controllers that provision resources and volumes and this is what makes Kubernetes great we want to do achieve the same level of isolation but at a one level higher abstraction so we want to apply control loops on tenants across multiple regions multiple clouds multiple clusters our stack that we are using for this is a fairly straightforward we didn't work for authentication and tenant management get up for a get up scope repository so there are a couple of settings that you do need to still apply in a get up repo around enforcing no deletion of history archiving it applying approval rules and ensuring that branches are protected so there are a number of things that you still need to do to make a get ups repository a get up square repository but those are fairly straightforward and to implement we use flux as a get ups controller that continuously reconciles packages our application our application is essentially a number of pods that run connected to a shared SQL server database just outside of the stack we use v cluster which is a tool for isolating tenants machine control that does a lot of work with Kubernetes API that has a lot of CRDs that are used and it's so we we made the decision to deploy every tenant into a v cluster and they they get control over that that Kubernetes API and finally we do and generate and store secrets using SOPs and key vaults this lets us have every and restorability of of environments and high levels of security how this works so again we start with clerk that comes in and makes a call or a web initiative to call to our tenant controller this is a go project that uses the go get library interfacing with the repositories first thing it does is generating the manifest that make up a tenant in our case this is a namespace and a home release also generate a database password and encrypt that using SOPs together these then go up to get as a PR you could also do a push directly to a branch a PR gives you a little bit more control in terms of the cross cutting governance that you can apply so you can for example apply rules that deletions are only implemented after a pair of human eyes looks at it or you could implement forward detection pipelines that will stop a deployment if forward is detected building pipelines that type of thing that it gives you all of those great features that we're missing out of the non-github control plan we get that for free with a GitOps repository this this interaction then produces a single isolated control loop nice and easy to think about and manage and from here flux picks up these manifests that are stored in the git repo decrypts any resources that that it needs to to note share the tenant controller and the flux controller or the workloads in general are physically isolated so the tenant controller being privileged in terms of being able to manage resources databases passwords that type of thing and the workloads only being able to consume things the flux uh flux goes out and creates a new namespace and inside that provisions in the initial chart this chart generates a new or deploys a new v cluster v cluster is essentially just k3s running inside of a pod and it then substitutes the API server that new pods run under to this new v cluster API server and it's important to note share that these workload clusters we run with very limited privileges so they don't have uh access to other namespaces they don't have access to elevated IM service accounts it's quite limited in what they can do what happens is some of the secrets come in these get injected into the v cluster the home chart deploys into the v cluster that will create a pod that pod doesn't deploy inside the v cluster that actually mirrors out to the parent cluster and ingress and networking then happen at the parent cluster level with the state being reflected into the v cluster you don't necessarily need to use v cluster you could use a standalone home chart and just deploy pods it really depends on the level of isolation between tenants that you're looking for if you're not running a pod-based architecture or you're running a hybrid architecture where you have some pod-based resources and some cloud-based resources then there are a number of good ops infrastructure controllers that you can use cross-plane that we use AWS controllers for Kubernetes or AK and GKE config connector all good options for deploying cloud resources I don't recommend deploying these resources from within that tenant cluster from a security and privilege escalation perspective I prefer to keep that physically isolated so what would happen is we then have a there's a separate flux instance that runs in a in a privileged cluster this privileged cluster then goes and creates resources from that same home chart maybe a different home chart with the cloud resources cross cross-plane picks up those resources and then provisions the access in our case we're giving access to a database and then goes and assigns those AM service accounts to those pods that are running in the workload cluster you can do escape patches as well fairly easy here so let's say you have an accountant and the accountant wants to mark a tenant as the billing is overdue and as a result pause the infrastructure and have the infrastructure display warnings to the user decommissioning infrastructure they can then go in and push changes directly to that gate repo other than individual making commits or you could have a system making these commits as well via a tenant controller or a different controller so you get a nice isolated control impure again and and as a result you have lots of little isolated control rooms that work independently of another it's easy to think about easy to troubleshoot so what are the benefits that we get out of this so everything is auditable by default the escape patches are easy to implement so whenever the tenant controller does you can do yourself out of the manual by API we get that nice baked in rollback and recovery from get commits and tagging the complexity is reduced significantly so whereas a non-give-ups control plan is highly coupled with that database and with the triggering of actions between them you know give up space control plan you have handouts between systems that are not independent of one another and not even physically connected and then you have a very low total cost of ownership so implementing this type of control plan with a lot of existing tools we we bought a tenant controller it's probably I think less than a thousand lines occurred so it's not a very difficult thing to do and there's quite a lot of tools around this area those existing tools have a lot of mind-sharing the market troubleshooting them in isolation is very easy and as a result you have a system that that is resilient reliable and cost effective here are some links to some of the the projects that we use in the flang source tenant controller is a little bit customized to our application use case but you could take that and phone it and use it for your own purposes plexi the AW screen for this creates is another give-ups connector and request that we use for tenant isolation thank you very much everyone I hope you enjoyed