 Hi everyone, thanks for joining us. I'm William Dennis. I'm a product manager on Google Kubernetes engine, and I'm joined today with by Alexis Hi, I'm Alexis Richardson. Is this too loud? I'm CEO of Weaveworks and I'm also the chair of the TOC for the CNCF Which is the home of Kubernetes and other projects. I've been working on the Kubernetes conformance if you saw that yesterday in the keynote Yes, we love Kubernetes conformance Okay So we're gonna talk today about how to go faster using modern practices, which are of course variations on older practices And we call this GitOps to kind of get across the concept of your driving operations from Git And I'm gonna give you an example to go first This is a company called Cordova which uses our products and they have a team of I think 15 or 20 people and They are based in San Francisco. They have a website that does internationalization and Localization, thank you And they run all this on Kubernetes engine and cloud builder with Weave cloud This is a chart they asked us to share showing their productivity measured in Jira issues and other metrics they use The lower line is where they were before they started using Weave and GitOps which we're talking about today and as you can see their productivity essentially doubled and they went all in Not long after that and it's steepened after this And this is because they managed to go from deployment once or twice a week driven by Jenkins CI To automated deployments on Google cloud using weave deployment, which we'll talk about today And this enabled them to do things like reduce the time it took to respond to customers By essentially going back twice as fast and reduce the time it took to fix bugs So if you think about it when you start to deploy More than once a day you get a feeling of empowerment and control. You're no longer Doing things manually watching things break having to fix them You're actually moving with confidence and you stop worrying. I asked them, you know, what's the real benefit here? Developers stop having to think about this stuff It frees up our minds to concentrate on our business that we want to spend real developer time on and theirs is based on things like machine learning So machine learning developers want to machine learn. They don't want to deploy So I think these numbers are pretty pretty compelling in terms of the change and actually it turns out that although we gave them a tool It's a pattern and you can do this in lots of different ways And that's really what we're going to be talking about in this talk And if you use this pattern right you too can go twice as fast by doing many more deployments today So this pattern is called GitOps which builds on DevOps practices And at the core of this is the idea that Git is a single source of truth for your whole system This is a really profound statement It's not not been easy to do this until fairly recently because we've lacked Enough declarative infrastructure that allows us to describe our whole system and then store it in Git But take into this extreme Once you've got the whole system described in Git you can make all your changes to your running system through Git Which essentially means you no longer use things like kube couple But you start making pull requests instead And that has the additional benefit that any developer who understands how to do a pull request can also make operational changes Which lowers the cost of entry which is amazing plus Git is an amazing tool for working in teams You get an audit trail you get comments you can see why people did things you can go back and forward in time You shouldn't be rolling your own stuff here. You should be using a tool that works and we found it We use it ourselves inside we've works To run our own systems We could do things like blow up our entire system and recover in minutes just by using this technique So we'll talk about But just to remind you a Lot of what we're showing you is an evolution of what people already do and have been calling GitOps for nearly 10 years And here's a chart. I found on the internet woman called Helen Here's a consultancy specializing DevOps really showing the evolution in stages and it's a tiny bit each year And I think GitOps and Kubernetes coming together Represent the latest iteration of this and I hopefully will see more in the future with things like declarative apps and so on and I thought Brendan's talk on metaparticle would fit in quite quite nicely into this world too And I mentioned declarative infrastructure, but just to be clear for those of you who are not sure what this is Here's a yamal One of my favorite things in the world of Kubernetes yamals It's almost like you know if you you're outside the club until you figure out what the yamal is so this is a config description of the Kubernetes app here and The key point is that it's a set of statements that you can verify and you can use it to Reproduce a system from that set of statements. It's different from a set of instructions So we use the cloud of infrastructure up the wazoo Kubernetes docker terraform ansible and our entire system in we've cloud, which is a SAS product Stores monitoring rules config dashboards and the code all together and get with the full audit trial around it the developers can contract And then we can observe our running system and compare it with what's written down in git And we have diff tools which alert us when we see when differences occur that are not picked up by the Kubernetes orchestrator We have an ansible diff a teradiff a cubed if and all these things fire alerts and tell us when something is happening And then we can converge from there So the original DevOps concept was Version control your config. We're extending that by version controlling everything And Kubernetes lets us do that because it's such a rich system with so much that you can declare So just to recap Git is a source of truth the whole system You can observe the running the running system, which is different potentially from the desired state If it is different you need to be told because that might be something you need to fix When you do want to force conversions, you can make a pull request or sometimes Kubernetes will pick it up for you an orchestrated away, but sometimes it won't And we respond to anything from small changes to full-on crisis this way And it means that anyone can come into our team and be productive very very fast if they know get and I believe that a Developer pretty much today. It means somebody who can merge your pull request in gear So we're gonna hand over to William in a minute. We're gonna talk next about the three pillars of GitOps And try and go a little bit deeper into the fundamentals and then come back up again at the end. So the pillars are pipelines observability and The counterpart of observability is control Observability controllability go together as you see things you can fix them on pipelines This is an area where there's some confusion the key point about automating your deployments is everything has to be joined up and That can mean that usually means build and deployment and release automation We're good as the desired source of truth. That is what I joined on pipeline is William do you want to take over? Sure. So let's talk about how we actually join up these components into a pipeline Basically, the deployments are actually controlled using the operator pattern. So you've probably seen this pattern a lot in Kubernetes The operator is what takes the desired state that you declare and give Kubernetes and turn it into the state that it can then observe So one common example is the deployment object in Kubernetes You will declare that I have a container and I want to have a hundred replicas of that container. That's the declared state The operator is then responsible of driving the observed state in the cluster towards that state that you gave it So if there was only one and you and you said it should be a hundred it's responsible for making that true equally if your sound asleep and You know a bunch of nodes disappear and and that replica count drops down The the observed state is now not equal to your declared state and so the operator is then driving that back up So with GitOps, we have actually applied the same pattern used throughout Kubernetes To actually drive your entire config from Git To be the the observed state in the cluster so Basically to get to make this work everything has to be In git all the config should be treated like just like code right and Anything that is not recording those changes in GitHub is a is effectively harmful to the system because the operator is driving the Observed state towards what you've described if people are tweaking that in a way that you haven't actually put in git And that you haven't actually described It's not going to work for a while What happens though if you are starting not from a clean slate like you already have a running cluster with and it's working just fine So to get to this sort of happy path you need to have all your config under git management But you don't have that so what do you do? Well fortunately kube kube control has this Export flag in kube control get and you can just go through one by one all your resources and extract extract that out of the cluster be careful though with export because It's intentionally dropping some of the fields from the declarative config and these are fields that it thinks are not required So things like the node IP address, which is generally not used It will drop that but sometimes it gets a wrong sometimes it drops too much sometimes it misses something so as you kind of do this Export from the cluster make sure you're reviewing the the content What about secrets though? So we're saying that everything should be in get every single change should be Stored in git, but of course secrets are not very good to keep in get right because They're then visible to people that that maybe they should be visible to Well, but now we have a pretty cool solution to this so it's called sealed secrets. You can search it on github It's open source and what it does is it provisions a private key into the cluster and Then provides a public key that the developers can use to encrypt the secrets So they can encrypt those secrets and then store them in git just like any other configuration It then actually uses the same operator pattern. So it has its own operator That is then looking at these in these encrypted secrets and decrypting with its private key So what that means is that if you want to take this cluster and recreate it like we were talking before about disaster recovery And disaster recovery is actually a good kind of litmus test of whether you've declared everything correctly because if it doesn't come back up Then you probably haven't declared it. Well So when you create that new cluster you'd bootstrap it with that one private key And then all the other secrets come in and it just means that you can you can manage all the other secrets through the The GitOps workflow really cool really cool project All right, so what is this? What does it actually look like? What is the GitOps repository look like? So here I'm referring to the config repository This is a separate repository for your code and the reason we keep it separate is that you likely have some continuous Integration triggering on the code and you don't want changes to the config triggering an image because that kind of gets a bit weird So we tend to recommend one repository per kind of logical application or service And what I mean by this is is anything that that is sort of tightly coupled together. So Could be a bunch of deployments a bunch of services any other Kubernetes resources that sort of That are a part of like one output typically of like a team with their own like backwards forwards compatibility guarantees or whatever Then what you can do is use a separate branch for each environment So, you know, let's say you might have three staging production and test You probably have more but the idea is that you have one branch per environment and These branches map to a Kubernetes namespace or a completely standalone Kubernetes cluster And the reason we use the namespace or a cluster is so that you can reuse the same objects From staging to production. So if you have a deployment that's called foo You you don't want to have like foo dash staging and foo dash prod. It's much simpler just to call it foo Or and safer to be honest and deploy that into a separate namespace or cluster So a common pattern we see is someone will have a production cluster and then maybe a test cluster with with a bunch of different environments So then the process is any changes that you want to make you first make them in GitHub So a very common change would be you're bumping the version of an image, right? You've just released some code But there are lots of other changes that that should go through this as well In fact, all changes should go through this way, but things like health checks changes to the replica counts Things like that should all go through this get process where you would do a pull request Submit it to say the staging or a feature branch get that peer reviewed submitted tested and then when it's ready To roll it out to production you basically just merge that change from the staging branch to the production branch Now the benefit of doing it this way is that even though you're getting code review on this final merge You've already tested that exact configuration in staging So let's say you might have like fat-fingered and mistyped the number of replicas for example Maybe you missed a zero you don't want to take down the system or maybe you added a zero and you're about to run up a really big bill All these kind of things or you have a health check that's this is a really good one If you're if your health check is buggy Then it's going to be declared unhealthy, right? And what happens if everything is declared unhealthy the whole system goes down so that's why you want to Deploy these changes first to staging and then rolling it out should just be a merge And at that point you should know that the changes are fairly safe. Of course, there is one cash here And that is what happens if you have staging only changes So at the very beginning as you're saying this up or perhaps, you know throughout the life cycle You may need to make changes to staging in that case. It's a little bit tricky There's I don't think there's a really good analogy here But you just have to make the changes in the staging branch and you can use this git merge dash s ours strategy To effectively say that you have merged the change without actually applying the change It just means that the next time someone doesn't merge It's not going to pull in that change that was meant to be staging only finally Definitely use features like protected branches to ensure that all those code reviews are enforced on the production branch All right, so this is what the pipeline looks like and I kind of wish I had a laser pointer here, but Effectively the the top pipeline is your typical kind of CI pipeline The developer commits a change goes in the git repository for the code The container builder kicks in builds that image uploads it to a container registry Then we have an extra step at the end now Alexis talks a lot about this but a lot of people basically at this point just jam in their continuous deployment It's not so great doing it that way What we do with git ops instead is is rather than deploying that change we We edit the staging configuration and change the image to be the one that was just created So here you have like an automatic push to staging effectively Once the and that's that config up data at the end there Once that change has been committed into git the operator the deploy operator that I mentioned earlier kicks in and notices Oh, wait a minute the observe state in git is now different So the observe state in the cluster is now different to the desired state in git and we'll deploy that staging change automatically Typically prod isn't set up on the same kind of automatic push. So then You have an operator here User two at the bottom doing that merge from staging to production It goes through code review And once it's accepted it goes into the git config in the production branch At which time the deployment operator for the production cluster will kick in And do the same rollout and I'll exit this back to you Thank you So you forgot one thing. Yes, I believe that the operator has a name All right So it just so happens that we've works actually has an open source deployment operator called we flux Which you can point your cluster But that completely falls that pattern. Yeah, and I guess your cloud product has that you can see the gooey on our booth But the the open source piece is we flux take a look read about it. It's pretty cool. It's Inspired by the same sort of motivations that Diane described yesterday when she talked about spinnaker But it's Kubernetes native as I think William has made it very clear that that's quite important So I'll say a few words about the other two pillars and we'll wrap up and have some questions Observability which I think has been in on Twitter a lot recently in its simplest It's the property of a system that it is observable But it's also kind of shorthand for getting your monitoring logging tracing and visualization right Making sure that if things go wrong you can dig into your system and understand what's going on if your system is not observable You will not be able to find out what is wrong when something does go wrong And that is actually a big problem and the more complex the system the more of a problem that will be for you Because we're talking about deployment though What is observability driven deployment? This is a gentle guy called John Arundel in the UK who came up with a really neat description of this Which boils down to? Don't accept the pull request in that last stage if you don't have a way to check What's happened is the right thing and if it's gone wrong be able to take action to remedy it So observability is essentially tied to get ops because it's the property of a system that you can check That your pull requests are happening correctly or incorrectly at a service level Meaning anything from user experience, which is the most important thing if if you have users down to deep diagnostics So you've got to integrate your GitOps pipeline with your tools to observe things Which means if you do a service push you want to see the error rate on your service For example as it's happening and the more that you adopt policies like canaries Stage deployments the more important it becomes to see the stages impact on the system Early so that you can make the right decisions and this is how you gain confidence in your system and ultimately how you get to Go faster and faster to having many deployments a day so It's also the ability to think in terms of holistic Being like a doctor is my system happy is it about to become unhappy Is something looking smelling wrong don't because you really want to stop problems happening before they become real issues And then of course obviously fixing them if if they are So we're starting to see companies like this is from Lyft Mackline the author of Envoy has been showing this this chart. It's a Concept of a dashboard that unites elements of observability together Metrics like error rates or request rates Latency bars with events like a deployment has just occurred the canary went from 50% to 70% Which means that somebody can focus on the service's health as the deployment is occurring and as this is a short talk We won't go too deep, but there are blogs about this that you can read and control is the counterpart of observability and It basically goes back to that principle of get ops Which is do things through git to make sure that you've got a consistent and correct Running system relative to your desired state meaning if you change your desired state in git Then of course your system must be updated as well rather than messing around the cube couple if you can avoid it And the more things that we figure out how to describe using those good old gamels the better so things like security policy application policy Monitoring other properties And then good old orchestration is a model of control What Kubernetes is actually doing for you and other orchestrators like swarm? Is correcting the running state relative to the desired state for you And difference thing we mentioned a few times these are somewhat crude But important tools for alerting you when something is actually different between the observed state and the desired state so that you can Convix it and convergence is The way you do that so this is a picture of this desired state observed state desired state observed state So control means convergence And here's a picture of that diff tool firing Something is wrong in kubernetes some examples I mentioned earlier that we Have a habit of blowing up our entire system. That's a total wipeout, but we can recover from git There's also lots of sort of smaller things like i'm changing the property of a service changing a policy on a service And having mentioned services a few times. I wanted to just emphasize that this implies a certain life cycle So if you want to know what you're doing, this is a picture of it You start and get you go kubernetes kubernetes, and then you use your monitoring and tracing tools to see what's going on to construct observed state And then you have a control loop to finish it off which may update git So there you go That is what operations is in the kubernetes world and the githops world And for those of you who are fond like i am of acronyms There's a famous acronym called uda, which came up in the 60s observe orient decide act And it's been used to describe many things It was used by our coda hail Famously about seven or eight years ago to describe how yama So yes, it was yama do their operations So inspired by that i've called this the ruda loop because it's release driven release Observe orient decide and act. So that's your life cycle And so at the risk of being cheesy as an x mathematician Here's a fundamental theorem of githops the the core of it Only what can be described and observed can be automated and controlled and accelerated And that is what kubernetes really incarnates for you and all of the other tools that we'll see in cloud native should do this So william back to you right so just to summarize the three core principles of githops that you can hopefully take away And if you're on twitter by the way, we've got the githops hashtag. So a bunch of this content is already up there But the three core principles are Use declarative config to define your application and the services All changes need to go through a git review process. No one should be using um cube control directly So that means even if you have a production emergency Don't be using cube control. You should just be using like git push force or something right? So everything should still be going through git And finally use an operator in the cluster to to drive the observed cluster state to the desired state As declared by that same configuration in git Last but not least githops is for developers We I believe maybe you believe it's too That more and more people will have the job title of developer in the future Relative to operations and that doesn't mean those things are going to go away It just means that The way that we deal with systems is going to look more like how developers think and do things As we as we get closer and closer and closer to you know, the dream presented e.g By brandon yesterday of Something that's so simple that a young person could come in and build a sophisticated that really in very little time And and githops is just taking that to its logical conclusion for operations Using cloud native tools Cool, okay. So do you want to ask us some some questions? So we have a mic In the middle there if you have any questions. We also have preceded a couple Which we can chat about Hey Awesome. I really like this whole thing. I think the devil is in the details. I guess with this It really makes sense for kubernetes deployments, but when we talk about other things like Like different things like dashboards and kubernetes and like console configs and all these different things like Do you end up writing your own operators that can read what's happening and then diff it and and then apply it to production? Yes So that's first question and second question is there's got to be some stuff in your stack that you're not doing this with What is it and why? Please so I can describe a real system, which is weave cloud, um, which is relatively Complete interpretation of the githops idea So we have three tools that I mentioned kubernetes ansible diff and teradiff Ansible diff is just a wrap around ansible's own diff capability Teradiff is something which tells you when terraform is out of sync and we provision on amazon using terraform We found that to be very effective And kubernetes the one I showed you an example of The one that we don't have yet described is the app itself and I think that's kind of next so There is a sig It's called app def the app def sig And we would very much like people who care about declarative definitions of applications to be part of that So please it's like more than just the deployment. It's it's kind of everything related to everything that sits on top of kubernetes Yeah I think for me the one thing that I hadn't used it really with was secrets Actually, we found it yesterday that binami had that really cool silk secret tool. So, um, but it's good that there's a solution now Thank you I have a question about sort of the multi branch strategy and actually things diverging right in the terraform world There's a tool the wrapper called terra grunt Basically delegates to terraform and you just define the variables in other words Just the different configurations for the different environments So how do you manage in this thing that I don't actually have all these different yaml files and complete chaos versus I have a single source of truth for I want to deploy this and just make these changes in each environment So I think I think that would be one way to do it I think I probably meant to say on that slide that this was just kind of like one example of like like how I've done it So I guess you can store everything as config as config and then and then you just don't change the yaml at all So maybe a config map would be what you would use a config map for the image and the replicas and things like that Yeah, I got you. Um, okay Yes, I think I think we're not super opinionated on the exact implementation. It's kind of like the just the yeah the best practice Two quick questions. Uh, what's the um conditions that cause diff in the environment and like what's the you know, uh Triggers that cause cause diff because you wouldn't expect that And then the second thing is do you use releases at all in in your get workflow or is it just specific branches I don't think we're using get releases. Yeah Um, I'm trying to think of some good examples I mean, there are things that kubernetes is not aware of like machines going away, which might be picked up sooner With the cube diff. We have a range of different alerts Yeah, see me later and I'll put you in touch with engineers and I'll give you all of them if you like Yeah, I mean, so basically it's it's events that are out of your control. Yeah that okay, right So, I mean, I guess one example could be someone applied manual change That that's the sort of situation like what what do you do, right? With the true operator pattern, you would actually just blow away that change immediately But I think you might have had a situation where We're all actually the deployment operator will observe the fact that someone modified the cluster manually and put an alert through on slack Or something. Yeah, I mean, that's the kind of classic is you've got somebody I mentioned several times It's great when new people join the team and it's really easy and simple But actually also new people can join the team and do something which is unintended Um, and those that's the kind of thing that will get picked up on of these diff tools quite quite often Well, somebody goes out to lunch doesn't someone else comes in and doesn't realize that something has changed But I think from my perspective like I would I would take a fail I would basically say whatever's in git should be deployed and if someone tweaks it manually Which they really shouldn't be doing it would just get blown away. Well, that's what happens I mean that if you use we flux it will do that to you Yeah, and actually one thing to add about this this get ops flow I think is that it's really nice how like every commit is is like an atomic commit Um, and every every change like can be rolled back straight away So sometimes an application like until we have this like app dev thing An application might consist of many different configuration files And even though some of the communities objects like deployment have their own history Uh, it's not necessarily related to everything else So while you can roll back a deployment Can you roll back the associated change that you made to the service? The answer is no, but with git ops you can just roll back effectively to commit you can literally do a git roll back Um, and and will undo all of the changes. Um, so I think yeah, that's that's one thing I meant to mention When I said at the beginning Making git the desired source of truth Well, the source of truth for your desired state is profound. It's because of things like that is really important It's kind of like comparing like the old like CVS approach to to git Whereas in CVS you have like version history But on a per far basis and with git you get it across the entire repo And you can see who did what and what who was around and you can use git rules for security as well Right So caught over who we mentioned at the beginning have discovered that their sandbox compliance is now done through looking at the audit Trail of the changes they made to the system which is all recorded and git which is great Right, so helps them for compliance and yes Next question Yeah, awesome talk. Um I get the kind of kubernetes side and the life cycle of everything on top of that obviously lower down in the stack things on quite as You know nicely tidy um What's your experience with that and kind of what's your update strategy for kind of Inplace kubernetes upgrades like are you Do you have two environments that you swap over like because obviously there's some stuff there that just can't happen as nicely as a cube deployment with a connect Yeah, I think well one advantage of having everything declaratively defined and stored in git is that it should be easy to spin up a new cluster And and try that out with the new version of kubernetes. Yeah, so I think that's one benefit you get I think you'll write that that we're not actually solving that yet. Um, and that's a problem So there's still some work to be done on the kind of new kubernetes up Test applications rework traffic and things like that. Yeah, I wish there was a better solution for for the cluster upgrade but if you go to The website if you do a google search for git ops You'll find some blog posts which refer to An explanation of how we've worked we've cloud solved that specific problem in detail Okay, but it's it's it's not something I would necessarily recommend everybody adopts I mean, it's it's a system that we built for ourselves if you want something that's more like product Um that everyone can use and works the same way etc The good news is that things like kubernetes and and the tools around it like cops are starting to bring these capabilities in So you can have idempotent updates and things like that Cool. Thank you. Quick second question. Um, do you have any checks anywhere in this stack for you know people or new staff accidentally putting like latest instead of a specific image for a kubernetes deployment or something because Obviously that isn't potentially going to stay the same depending on the maintainer and you don't have any control over that Very good point. Um I I guess that should be You could maybe have that as a as a part of the pull request if you had like a bot or something reviewing Um, kind of the other one is signatures So an obvious extension of the diagram that william showed would be a step to check that an image is signed by consulting a service like Is it called grapias? Or notary from docker, which is in the cncf now makes sense Thanks. Yeah, that's actually an excellent point Like anything that that is like a work around this get thing is is dangerous whether it's the latest tag or Or whether your docker image is like pulling down like I heard someone was like pulling down a warfire right like As a screening so anything like that just breaks them all like us Hey, so you talked about metrics and you talked about pull requests. So Have you seen like pull it? tying in metrics and validating the pull requests or do they just happen completely unrelated to each other Yeah, we've got some examples on our booth actually that you can go and have a look at Um, we don't study metrics of the development process though Cloudbeats have some nice examples of that at their booth Okay, cheers. Great. Well, I think we're pretty much on time One last question. Um, yeah, so what's the reasoning behind having to get? 140 app and 140 config So again, this is kind of just our opinion But because the continuous integration typically gets triggered off a commit What you don't want is to have like a code change which triggers ci Which then updates this config of the staging which might like trigger another ci Or you have to have to make sure that wouldn't happen and you can kind of get like a weird loop Or you have like someone just changing config that then rebuilds the the image unnecessarily So I think you can get it to work that way But yeah, maybe there are some reasons To do it. Yeah, it's not a it's not a hard and fast rule I guess Okay, so the yaml files just live in the config. So if I want to change the replicas I cannot do it in the app It would be in the config repository Okay, with with the way we've set it up, but you know, of course if you if you have a better approach by all means But I think Yeah, I guess in my head like I could never get around the fact that I don't want the replica set changed to actually trigger an image Rebuild of the code I mean, I guess you can do that and then just sort of throw away the image It's not you know, the image is not actually changing because the code didn't change inside it But yeah, I think we're out of time. That was kind of why we set it up that way. Thank you very much everybody Yeah, thank you