 Hi, folks, and welcome to organizing teams for GitOps and cloud native deployments. Today I want to share with y'all some of the things we've learned helping teams adopt GitOps for cloud native, and some of it's based on our own research, and some of it's based on what we've observed with teams directly. My name is Sundip, and I've been with Google Cloud for almost seven years from most part. I've had several different roles and titles over that time, but ultimately it's revolved around helping teams adopt and optimize for cloud in some form or fashion. You can always find me at Circus Monkey, that's CRCS, MNKY on Twitter, if you've got questions about GitOps, DevOps, or anything else that comes to mind. Now there's quite a lot I want to cover with y'all today, so we're going to move through it pretty quickly. First off, I want to share some of what we've learned through our own DevOps research, and then I want to lay out some basics around some of the challenges and assumptions that we're going to make and use throughout the rest of the presentation. Then we're going to jump into different cloud native tenancy models and some of the associated workflows. Next I want to talk about version strategies as it relates to different deployment environments that we should cover how teams collaborate together, particularly around upstream dependencies. And finally I want to talk about guardrails and preventing declarative and imperative operations from breaking things in prod. So let's kick things off with DevOps research. Dora or DevOps research and assessment is the largest and longest running research project of its kind. Dora's goal is to provide an independent and tool agnostic view into the practices and capabilities that drive software delivery performance. Rigorous statistical methods are used to present data-driven insights and the most effective, efficient ways to develop and deliver technology. Ultimately our goal with Dora is to use what we learned to help teams improve their own software delivery performance. But how do we measure that software delivery performance? Well in our research program we have found a valid and reliable way to do just this. There are two metrics representing speed and two metrics representing stability. For the speed metrics we have deployment frequency which is how often you deploy and we also have your lead time for changes. And that's basically measuring the time from a commit all the way to that commit being deployed into prod. And then on the stability front we have change fail rate which essentially is how many changes do you ship that have failures within them and then the time to restore which ultimately measures how long does it take for you to remediate a broken problem in production. Now these four metrics can be applied to any kind of software delivery whether it's web or mobile, firmware, what have you. And using these metrics we can actually bucket teams into specific categories into low, medium, high and elite software delivery performing teams. But these are just trailing indicators of software delivery performance. And that's where the leading indicators come in. Now we don't have time to go through all of the analysis from Dora but we know that there are specific leading indicators that drive software delivery performance and have positive impact. For the purposes of this slide I could only include a subset and I tried to capture best where it get up sort of fits. You know the capabilities listed up here on the left all contribute to improvements in either culture or continuous delivery. And those two have the most positive outcome and the most positive impact on software delivery performance. And they also have the most positive impact on the stuff that drags teams down the toil of rework and deployment plane which ultimately leads to burnout. Now something that can be surprising when we talk about all of this stuff from Dora and from the DevOps research is that it's going to slow things down, right? If we get better at the process we get better at the approach we will lose velocity. And actually what we found in the research is that it's actually the opposite. As teams have gotten better and more stable they're actually able to increase their velocity. And it actually all comes back to a concept from lean manufacturing which is this idea of working in small batches. And working on those small batches they have little sort of little impact on the overall system but you have a bunch of them and they move through this pipeline in a way. And it's easier to roll back a small change and it's ultimately easier to operate again high velocity and with a high degree of stability. Now Dora applies predictive analyses to identify specific capabilities that are associated with high performing teams, right? That's what we've learned from all this research and that's how we see this reflected in terms of the stability going up and the velocity going up as well. Now that's just a little bit about DevOps and it's important because I think what we've learned ultimately from Dora, one of the big takeaways for us is that it's not about the tools, it's about the process and the people involved. And that's what really drives the stability and the velocity improvements. So now before we get further in I want to lay out some of the ground rules starting with some of the challenges that we're going to see and giving you some baseline assumptions that we'll work with throughout the rest of the talk. Now the overarching challenge for cloud native teams, those are relatively simple and straightforward. They usually amount to having a lot of individual teams pushing code to a lot of deployments or deployment environments and those deployment environments are spread across many or multiple regions. It's a simplistic view ultimately that encompasses quite a bit of complexity. So let's try to break it down starting with some foundations and some assumptions. So why do teams even want get-ups? Well it's because they want to get out of the imperative operations business. These sorts of approaches are hard to scale, hard to fix and hard to roll back especially in case there is a problem. So we don't want to do this approach anymore. So we adopt a get-ups approach that gives us some very specific properties namely declarative. It's a system that's a system managed by get-ups must have its desired state expressed in a declarative fashion. This also makes our resulting infrastructure applications versioned and immutable because desired state is stored in a way that enforces immutability and versioning retains a history of what happened up to that point. And with get-ups we want to be able to pull software and updates automatically. So we want to be able to pull those in from the cluster and pull them in from those repos as those changes are committed and made and we want those changes to be continuously reconciled for any protection against drift. So if in case there is an imperative operation against a particular cluster if against a resource that's coming from a get repo we know that that imperative operation will actually get overwritten on the next reconciliation loop. So these are the principles that we get by adopting a get-up style approach. So now with get-ups out of the way I want to talk about some of the other assumptions that I want to make up front. So for our notion around kind of many teams we're going to categorize them into some pretty coarse buckets you know forgive me. We have application teams, operation teams and platform teams. Now for infrastructure we'll be assuming Kubernetes as your cloud native deployment which makes sense and some get-ups tooling. We don't need to be specific about which get-ups tooling whether it's Argo CD, Flux or KidFixink. Just know that most of what we're going to talk about involves one of these sort of popular get-ups tools. And then for deployment regions we base that on where our clusters are going to be physically located whether that's in a private data center in a colo facility or in a cloud region or cloud data center. And finally for deployment environments we have the standard prod, staging, QA and dev. Now before we move on I want to talk quickly about teams and responsibilities because I know I had those kind of coarse buckets but I want to make sure we clear on who is responsible for what in this side of a setup. So starting with application developers they're of course responsible for all things related to application right building, packaging, testing, etc. Well then we have app operators they're responsible for deployment manifests and making sure the app or service is up and running. And then finally we have the platform admins. They cover the infrastructure bits not necessarily the compute layer of Kubernetes but they may cover the just one level up from Kubernetes so things like RBAC quotas, resource limits, all that sorts of work right the kind of initial infrastructure that has to get laid down on Kubernetes before application teams can run and scale. Now again these are coarse and imperfect kind of categorizations but in my experience most organizations can model their way into something that approximates this division of responsibility. Alright so now let's go through a couple of different tenancy models and some associated workflows. Tenancy in Kubernetes comes down to ownership and access. If your team has their run of the entire cluster it's probably a single tenant setup and having access to that whole cluster though doesn't necessarily mean you'll be given cluster admin privileges it just means that you can deploy essentially willy-nilly and you have access to the whole thing and ultimately the reason you probably have a single tenant setup is because you have some sort of particular scaling or hardware need like high performance storage or attached GPUs. If your team gets a namespace as their only playground then you're probably living in a multi-tenant world and this tends to be the case for the long tail of application teams deploying to Kubernetes environments. Now regardless of which approach single tenant or multi-tenant the platform team still has a role to play so let's explore where they fit into this equation as well. Now every organization employs different collaborative approaches and cloud native deployments are no different. Ops teams may have shared repos with platform teams or they may have distinct repos. In the case on the left the get up setup process is simple but the organizational process may be more challenging because there's more coordination involved if two teams are sharing the same repo. Now on the left well well sorry well on the right you've kind of flipped that problem on its head and you've made the get up setup more complex with distinct repos but the organizational setup is easier right these teams don't really have to communicate and they'll just push their own objects to their repos and their repos will get pulled into those clusters. Now ultimately as long as teams aren't stepping on each other this setup is all good. Multi-tenant approaches essentially look very similar right the only difference here is the scale and complexity of the repo management. In a shared repo approach this may have to be accomplished via things like PR reviews on protected branches or if you have distinct repos where platform and ops teams are completely separated. Now this simplifies most of the day-to-day get management but it does make the get-ups configuration much much more complicated with this distinct repo approach. The get-ups tools that are out today have different ways of supporting this like Argo CDs application or app of apps model. So there are different approaches out there that's one with Argo with config sync there are other repo and distinct sort of root repo and separate shared repo options as well but ultimately you're putting the complexity back onto the get-ups tooling and you're simplifying the work on the organization. So let's take a look at an example workflow where application operations and platform teams all have separate repos but they are effectively able to collaborate without stepping on each other. Now it starts with the dev team writing code and their applications and building those artifacts and then those artifacts are stored in some sort of artifact repository then ops teams start with a base config and build their specific manifests referencing those artifacts that the application team built those artifacts sorry those manifests are then hydrated into actual config and stored in a deployment environment repo usually specified by a branch and then that config is continuously delivered to kubernetes so they're just a distinct a distinct step between their sort of templated configuration and their hydrated configuration. Now the platform team writes the infrastructure and policy manifests and pushes those directly to kubernetes or through some tooling those are intended to prevent imperative or runtime issues like quota or resource limit violations so that's one kind of example workflow this is not a again a one size fits all and you'll hear me say that many times throughout but it's one example of how these three teams can sort of coordinate and collaborate together. Now if we build upon that example we need to talk through some additional considerations as well so for starters what if config and infrastructure right that the ops and the platform teams what if they had a shared repo between those two teams well how do those teams need to work together is the repo owned by the platform team or is it owned by the ops team are there weird permissions or protected branches that we have to worry about if it's owned by platform can ops just push whatever commits they want or in like the shared model is there a pr based approval process for things on a particular branch like prod maybe or does ops have total control there as well now when we talk about cluster objects this is where you know what if ops teams and platform teams need to actually collaborate on things like quotas or namespace configurations well in a shared repo PRs could be used for that sort of work but with distinct repos there may be other approaches like for example the ops team can make PRs against the platform teams repo as it relates to again things like quotas, namespaces, resource limits but the platform team should use policies as a means of ensuring ops doesn't deploy weird things or break things in prod so the the collaboration is very different right one's a PR based approach one is sort of a policy based approach does really come back to how your teams view the division of responsibility and how they want to divide that work now there's also the option of you know maybe there's a an approval process at continuous delivery time before objects get pushed to kubernetes and maybe that's done by the platform team maybe it's only particular to maybe prod but not the other deployment environments like dev and qa and staging right because we want to stay out of people's way as much as possible and let them work quickly now these are all the sorts of things that need to be understood and again it's not a one-size-fits-all every organization is different in its own ways and everyone views the division of responsibility and the ownership in different ways so instead of trying to figure this out with tools which is not going to work don't let the tools drive this process this should be decided and documented by platforms and ops teams together right they should be collaborating on what the process is going to look like before they even get to the tools because the tools can be made to do whatever you need and they can follow whatever process you've got but if there's no clear indication of ownership or permission or responsibility then you're left kind of wondering how we're going to fix and understand all of this now versioning is the next topic versioning is relatively straightforward but there are a couple of considerations to remember so now this is not hard and fast guidance that's 100 correct by no means but for many organizations and their teams a branch per non-production environment that model tends to work well and provides a pretty clear process and lineage for GitOps deployments so that means for dev qa staging we use branches within our Git repos to define what gets delivered to kubernetes now this example yaml on the bottom half of the screen is from an argocd application manifest where we're specifying the target revision to come from the staging branch of this repo of course the other GitOps tools out there like flux and config support similar approaches i just wanted to put up one example here now for releasing to prod we move away from the head of any particular branch to something a little bit safer the safest approach always the safest approach is to use a commit hash hashes are immutable and give us a very specific commit to pin against but as teams and applications grow this can be challenging to maintain especially as teams have different release velocities and the number of applications sort of explodes over time so instead of trying to update a whole bunch of GitOps controller crd's with different commit hashes all the time we could take a slightly different approach and use tags tags are the next best option from using a commit hash their only downside is that they are immutable so we're back to having good process and hygiene around git be really really important to help keep this from becoming a problem and from get to keep it from getting abused now regardless of whether you deploy to prod via commit hash or tag you want to employ some good basic principles and practices first and foremost those crs that specify repo hashes or tags those should be deployed in a declarative manner and not using any imperative approaches like CLIs or some other you know imperative other approach out there then you'll want to build out some sort of distinct delivery process outside of your application pipelines to deliver these updated CRs that specify new you know branch names or new commit hashes or new tag names and that deployed process should match what your organization wants whether they want to do kind of a blue green deployment and switch 100% of the traffic over or they want to do like a canary style process where you know small percentages of traffic are shifted over to newer versions of the application that's really getting back to what your teams want as that sort of outcome and finally you want to have a documented approval process whether with via human or via automation to orchestrate a safe rollout but everyone should be able to say how is this deployment going to get orchestrated okay it's via automation that means the automation is going to check for health checks and readiness checks before continuing to progress further into the deployment or is there a human who makes that decision and says we're going to deploy 20% I'm going to check the numbers and then I'm going to deploy up to 50% and so on and so forth but it should be written down and transparent to every application team so they know how that deployment process to prod works now another way that dev ops and platform teams can collaborate is via an upstream dependency process now this is often done using things like helm charts so in this section I wanted to quickly mention another approach we're not going to spend a ton of time on it because we're flying pretty quickly through all this but I want to make sure y'all are aware of what other options are out there especially ones that match the getups model much more closely and that approach is called kept we don't have time for a full on kept tutorial or walkthrough but I'd recommend y'all take a look at kept.dev I like to think of kept as basically another way to use package management semantics but with bundles of kubernetes config that's it now one example that often comes to mind when we talk about upstream dependencies is this idea of having approved software packages that can be used by application teams so you could think of things like you know Redis or MongoDB right maybe the platform team or the security team has approved using Redis but they also have done it with very specific configuration details so they don't want anyone just grabbing the Redis image and just deploying it on their own they want to have kind of a carefully controlled Redis artifact and Redis configuration that gets deployed so how do those platform teams then share that with their application teams or their ops teams so this package management approach is pretty helpful because they can actually one pull that packaging in for that bundled config of you know Redis configuration but it also provides the opportunity for them to update that as well so as the platform team updates that configuration or revs that version of the Redis deployment the application teams and the ops teams can pick up that update again using regular old you know package management style semantics so that's why I like this approach and it's worth looking into now the last topic I want to talk about as it relates to teams and GitOps and cloud native is around guardrails and guarding against danger I'll be using some key terms as we talk about this in the next section so I want to quickly define them up front now first is policies policies are rules that tell us how we can configure our resource pretty straightforward when you're using Kubernetes policies can specify things like what labels are allowed on a pod or requiring you know images to have specific tags that sort of thing now policy management is the mechanism that helps us with the ins and outs of a policy so think of this as the framework the runtime helping us manage or pull in external data packaging testing that kind of stuff and the last part is policy enforcement and it really refers to the actions that will be taken and the scope of those actions so again in the context of kubernetes the actions will be things like allowing or denying admission to the cluster and the scope will cover types of objects like do I want to focus this policy on pods or services or secrets or config maps etc and which namespaces I should enforce that scope and that policy on as well now policies are packaged as a set of templates and constraints and the reason we do that is because the template allows us to have a rule and then the constraint allows us to enforce that rule in different ways or in different scenarios getting us the ability to really really target and narrow the scope of how we want that enforcement to happen the policy management aspect comes from open policy agent right that's a really broad and popular framework for for managing policy bits and the policy enforcement aspect comes from a sub project of open policy agent called gatekeeper gatekeeper essentially packages up opa open policy agent and delivers it as a custom kubernetes admission controller so it's there to allow or deny admission to the cluster based on whether you violate a policy or not now where gatekeeper sort of sits is inside of kubernetes obviously so as an admission controller what happens is any incoming method whether it's kubectl github's controller or api clients as they submit new objects to the kubernetes api the kubernetes api with this new admission controller in the form of gatekeeper checks with gatekeeper it says this object wants to enter the cluster how do we decide what to do with this object and then it either provides uh it gets that requested and it provides a yes response or a no response and it's just that simple when the enforcement happens right gatekeeper again reviews the incoming object compares it to the all the policies that are there it checks for the namespace scope the object type scope and again the policy rule and whether that policy is just there for auditing purposes or whether it's there to deny entry altogether and then it makes the decision and it hands it back to the kubernetes api and the kubernetes api then rejects admission or allows admission but more importantly crucially the way this works out of the box today is it only happens at deploy time it's not happening at any other point in the process so as an example if an ops team built an object that violated policy and that policy made its way through let's say i'm sorry that that object made its way through all the way to prod the only time they would be notified was when the object was deployed to prod and not when they built it which means there's a gap and there's a space of time there where the enforcement wasn't happening and the team that was working on it is probably out of the loop on what that change was even about or the platform team has to go and figure out which team submitted this this config object that violates our policy and they have to go kind of trace back the lineage which can be done but it takes work and again it slows down our release process so one of the things we talk a lot about indora is this idea of shifting left on security and so when we extend that idea to policy enforcement we want to shift policy enforcement to the left so we want that to happen much earlier in the development and the debugging process so that means when commits are pushed we can have enforcement happen right there so as you push a commit there's a test that gets kicked off that test comes back and says hey this is actually going to violate a production policy you have to go fix this i can't pass the build whether it's your application whether it's you know a deployment object for kubernetes i can't pass the build until you fix this because it's going to violate a policy and you can have that same approach work on a pr review as well so when a pr comes in the pr is automatically tested and says okay this object or this application is going to violate a production policy and you do that by having the infrastructure team or the platform team write those policies those policies are available for all teams to see and they're able to pull them in so they get the latest and greatest every time they do a commit or do a pr they get the latest and greatest policy library they're able to run their configure run their change against that policy library and find out right then and there if they're going to violate a policy well before they get to prod and if you go back to the example workflow i had earlier if we take a look at that when we talk about enforcing some of these guardrails there are three main spots we want to do it right i just mentioned one where we talk about commits and prs that's at the continuous integration phase we should also have another policy evaluation or enforcement element at delivery time just in case to catch any last minute things that might have bypassed an approach or come in through a different way and then finally we want to stick with the standard opa gatekeeper approach which is to run that at the kubernetes cluster basically as a bouncer for the front door so anybody that's going to violate policy through an imperative operation or some other api client also gets blocked right at the kubernetes door as well now we covered a lot of ground but there's one thing i want you to take away from all this the most important takeaway through the whole thing is that this is not a one size fits all approach right doing get ops for cloud native with teams it is a human and a process problem right it doesn't there is no one approach that works for every single team it doesn't scale to every single organization instead you want to take a very deeply collaborative approach and work with your teams early and often on documenting process and understanding and that documentation could cover things like what is each team's role and responsibility and be specific if it's down to things like hey this team only writes pod and service manifests or deployment manifests great if this other team is only responsible for config you know config map or secrets perfect what we want to have is a clear idea for everybody in all these teams everybody in the organization is if i need to figure out which team is responsible for which aspect of our application deployment who do i go talk to and once you have that process understood then you can design the tools to fit the approach you're trying to use to deploy applications thank you all so much for your time today i hope this was informative and beneficial as always feel free to hit me up on twitter if you need to that's at crcs mnky at sarcas monkey on twitter thanks