 Thanks, Daniel. Well, as Daniel said, my name is Mitch Connors. I've been a software engineer at Google on the Istio project for three and a half years. Hello, everyone. I'm Mustafa Prodan. I'm a principal engineer at WeWorks, and I'm a flux and flagger maintainers for many years. All right. Well, let's get to it. Everyone here should probably already know that Istio makes your service mesh more secure. It's the leading reason for adoption among our users. They want the security. They want zero trust built on MTLS. They want the ability to set off N and off Z policies to have their certificates automatically rotated. And it makes all of that really easy and seamless. But there's a little bit of a catch. 88% of Istio installations are running known CVEs. So if you're using Istio to make your service mesh more secure, and then Istio that you're running has its own CVEs, some of them 9.9.8 on the CVE scale, that's problematic. In fact, your service mesh could be less secure because Istio is there rather than more secure. So why is this? Why are 88%? It's not that Istio is not patching their vulnerabilities. We have patch releases that come out at least once a month. Each of them covers a bunch of known CVEs. But it's still not being picked up in the community very quickly. So today, we're going to talk just for a brief time about why our users aren't upgrading. I'm going to rely on Stefan a lot to tell us how GitOps can help us with this problem. We'll have actually a handful of demos sprinkled throughout and some takeaways. So this is sort of a timeline, a history of what we thought was the problem with upgrades in the Istio project. And as you can see, it's changed a lot over time. We first noticed that no one was upgrading Istio in Q2 of 2020. And we thought, well, maybe they don't know. Maybe users don't realize that there are CVEs in their Istio installation. So we ran a survey and we did find a few users who were unaware and we built some tooling and some documentation to help boost that awareness. We found a lot more users who were aware that their service mesh was vulnerable, but they didn't feel like they had time to get to it. Istio is not the only thing that anyone is running in production. They've got their own apps. They've got telemetry systems. They've got maybe some EBPF systems that need to stay up to date. And staying on top of all of it is just so difficult. It's like a treadmill that you can never keep up with. So we decided, let's make Istio upgrades easier. This was a part of the 2021 roadmap. You see it now on the 2022 roadmap for the project. Istio upgrades should be easy. And we launched surveys that went along with every release asking users how easy was this. And we saw consistently, from one release to the next, Istio is getting easier to upgrade, easier to upgrade. Nothing huge, but consistent improvement from one release to the next. So a year later, we ran our production survey again and asked, OK, how much of Istio is up to date in production? And 88% of Istio installations still have known CVEs. So, OK, easiness and difficulty was not the problem. What else? Well, right now, as of last fall, you had to upgrade Istio four times a year. And that's a lot. So maybe we should move to something else. You know, Kubernetes has now the three times a year release process. We looked at that a little bit. The sweet spot we landed on was once every six months. And we kept our release process the same. So we still release four times a year. But you can now upgrade directly from, say, 1.10 to 1.12 or 1.11 to 1.13 without doing an intermediate upgrade along the way. And then we extended our support window so that 1.10 and 1.12 would overlap by six weeks, giving you that time to upgrade. So now, we just cut your work in half. It's really easy to upgrade Istio. And so we waited six months. And then we surveyed production. And we found out that 88% of all Istio installations still have known CVEs. So it's not that it's too difficult. It's not that our customers don't know they should. It's not that it's too frequent. We need a new hypothesis. So this one's really controversial. Our 2022 hypothesis for why Istio upgrades aren't happening is that humans are bad at repetitive monotonous labor. I know, I know, that super controversial in a group of programmers. We all love our monotonous repetitive tasks. But that's what we're going after this year. That's how we hope to move the needle down from 88%. Stefan's going to share with us how we're going to remove that monotonous repetitive labor from Istio installations with some products from Weaverx. There are not products of Weaverx. They are CNCF projects. OK, so I'm guessing everybody here knows what GitOps is. You've already seen this slide. And this event, 10 times per day probably, the idea behind GitOps is you let your cluster figure out its state from an external source. And with Flux, that can be, I don't know, Git repo story or an S3 bucket. Or we have many ways of how you can define your cluster state outside of the cluster. So instead of you having to run commands to upgrade your infrastructure or your applications, what you can do is define how those can be automatically upgraded in your Git repo, for example, with Kubernetes custom resources. And something, the cluster itself, something running on the cluster, sees that, OK, I need to upgrade myself from this source. And I'm going to do it automatically. Going to talk about the Flux projects. So in the Flux project, we have two main projects. One is Flux itself, which is a collection of Kubernetes controllers. We have 12 custom resource definitions. And what Flux aims to provide you is like a LEGO system where you can build your own continuous delivery solution that fits your use case. Flux is not operated in any way. We don't offer you a UI. We offer you something that allows you to build your platform. And besides Flux, we have a second project called Flagger. And the Flagger role is to decouple the deployment from the release process. So if Flux does the deployment automatically based on what's in Git, that doesn't mean that those things that are in Git should be automatically exposed to all your users. And Flagger role is to detect, oh, you want to deploy a new version of your application. Let me test the new version on some of your users. So with Flagger, you can say, OK, when there is a new version, move 1% of the traffic to the new version. See if that works. Measure SLOs, like error rates, latency, other custom metrics. And if that works, add more traffic, more traffic. And if everything goes OK, only then Flagger fully rolls out the new version to all your users. And that's called progressive delivery. So this is how Flux looks like and how the continuous delivery circle is. Flux's controllers are even based. So every time you push something to your repo, Flux will be notified. And it will reconcile the new state. But you can also ping it from CI. It also monitors your container images. It can upgrade your container applications inside the cluster. You can say, for example, I want for Envoy, every time there is a patch release, I don't want to go into Git and modify that patch release. And you can tell Flux, scan the Envoy container registry. If there is a new patch release version, commit that to Git and upgrade Envoy all over the cluster for me. And Flux can be run on a management cluster where you can target all your environments, like your production cluster or staging. Or you can deploy a Flux installation per cluster and make that cluster independent. It also works with cluster API very well. OK. Flagger, I mentioned about it. Besides integrating with Istio for routing traffic between versions, it works with a bunch of other service meshes and also ingress controllers. So if you are not into the service mesh hype, you just use an ingress controller. You can still do AB testing and progressive delivery for your front end applications. And of course, how Flagger detects that everything is working primarily with Prometheus, which comes with the Istio installation with Linkardy and everybody else. But it can also reach two things, like CloudWatch, what's it called, the Google product where you have metrics. Oh, Stackdriver. No, it's not Stackdriver anymore. They rebranded. Yeah. The new Stackdriver. Also Neuralic and other solutions. So OK, Flux can take care of shipping patches and automates the upgrade procedure of your apps and infrastructure who upgrades Flux. So we release Flux every two weeks. So it's a lot of work to keep our Flux up to date. We have a solution for this. Flux is able to upgrade itself. And how we can do that, we use the GitHub action or something in our CI that monitors the Flux GitHub releases when that action detects, oh, there is a new patch release for Flux. It will open a pull request for you where it can bump the Flux version directly in a branch. Then Flux is, oh, my definition from Git, my own definitions have changed. So now it's time to upgrade myself. It pulls the new version, starts a rolling deployment, and that's how you can keep Flux fully automated up to date. And unlike Istio, you can jump from any Flux version to any other Flux version. So we had Flux version to users. We installed it six months ago. They ran the upgrade last week, and it worked, even if we released, I don't know, 60 versions between then and now. So yeah, Flux is all about backwards compatibility and being able to upgrade itself, because we think that continuous upgrades is the way to do it with, you know, in Kubernetes. It's great for that. Okay, so I'm going to start a demo. This will be a three-part demo. What I'm going to show you now is how we can install Flux on the cluster and tell Flux to install Istio, install applications, install the Istio gateway, setup ingress, and all of that with a single command. Okay, so I have here the Flux bootstrap command, and this is a, we can bootstrap Flux on clusters using this generic command, which works with any kind of Git server. You pass it your SSH key, and that has read access to a repo, and what this command will do, the Flux CLI will clone the repo history, will write its own definitions in the repo, will commit its definitions to the upstream, and then will install itself on the cluster. And we also have flavors of bootstrap for GitHub, GitHub, Bitbucket, where you can save Flux bootstrap GitHub, and it creates the repo history for you, it sets up the deploy keys for you, and so on. But today, we are using this command that works with any kind of Git server. Okay, so I'm pressing enter, hopefully it will work. What's happening now? I'm cloning the repo history locally, it pushes all its definitions to Git through using SSH, go on, go on. We are really in a hurry, Flux. And yeah, after it talks to the Git repo, after it places there its configuration, next what it does is it configures itself for a particular path in the Git repo. So you can use a single Git repo history to manage your whole infrastructure flits. You can have a production cluster and a staging cluster in the same repo history, then you can migrate workloads from staging to production using that repo, or you can use dedicated repo stories for each cluster or each environment you have. Okay, after it has installed everything, now Flux is verifying itself. If all the controllers are running, then we have four controllers which are part of the base Flux installation. One is the source controller, which deals with all the Git operations, and another one is customize controller, which can apply Kubernetes plain resources or customize overlays. And finally, the Helm controller, which are using now for Istio, we are installing Istio, now Flux is installing Istio on the cluster using all the Istio Helm charts. And if we look at the, we can do a Flux minus send Istio system, get Helm releases, and we should see here what the Flux did. So first, we have a thing in Flux where we can define dependencies between our infrastructure items. In the case of Istio, this is critical, because Istio has several charts, one contains the CRDs, another one contains pilot, another one contains the gateway. I cannot install the gateway until I have the CRDs, I cannot install pilot and so on. So in the repo, we can define this declarative and say install all things in a particular order and also do the upgrade in that order. So when there is a new Istio release, we first upgrade the CRDs, then pilot, then gateway, and so on. And you see here, for example, dependencies, Istio D is not ready. So what Flux does, it checks, it looks at the health check of each deployment and blocks until all Istio components are ready and it employs them in the particular order. So now we should see that everything is done. Yes, so we have Istio installed on the cluster, all our applications are there, the gateway is set up. Okay, one of three demos. One of three demos. One of three demos. Let's go. Fast, fast, fast. All right, so that was a great introduction to how GitOps helps you manage the resources that are in a cluster and we saw a little bit of what's going on with Istio, but let's dig a little bit deeper into what's necessary to automate GitOps, to automate upgrades for Istio. And by the way, these principles should apply pretty fairly to just about any service mesh project. I work on Istio, so that's where I built this, but you should be able to take the principles and carry them over. But first, why did we choose to work with Flagger and Flux? The only progressive delivery API that I could find that worked directly on deployments so that I didn't need to change anything about how the customer was defining their workloads was Flagger. I've found other tools that will allow you to do progressive delivery with a pipeline where you define your own steps, but I didn't really wanna do that on behalf of the user. Also, Stefan already did most of the work. The GitOps repo that we're demoing today, you saw the link actually back here, Stefan brought on GitOps Istio, has been around for some time. I just recently joined Stefan in making some tweaks to it, which we'll be showing off this afternoon. So how is this gonna work for Istio? Well, Flux is gonna manage, what we're gonna see first is that Flux is managing the Istio control plane. So when you merge an update, which by the way can happen when someone makes a change to your Istio installation, or when GitHub actions notice that there is a new version of Istio available, we'll get a pull request, saying, hey, new version of Istio, go ahead and install it. And there's checks and tests that will run on that pull request just like you all should have for your CI systems. Once it gets approved, Flux will see it in the main branch and will reconcile it along with Flagger, Prometheus, et cetera. We already saw that work, but in a minute we're gonna see an upgrade. Also, I should take a step back as we've been talking about pilot or the control plane, and terminology can get really overwhelming really quickly in service mesh. I am almost certainly gonna slip up and say data plane, sidecar, and proxy interchangeably or the envoy. Those are all the same thing. It sounds like really overwhelming vocabulary, but we're really talking about one thing. It's a proxy that runs everywhere. Everywhere you have an application, you need a proxy right next to it. So it's in the pod, running alongside your application and managing things. We just like to give it four names because Istio. So Istio is divided between a control plane and a data plane. We just saw the control plane get installed, and now what we're going to demo is having a control plane upgrade. Oh, before the demo, I wanted to talk about this. Along the way we found, Stefan and I found that it is really difficult to integrate Istio control with GitHub Action. So we launched our own get Istio control action. If there's anything you'd like to do in GitHub Actions in your CI system, get Istio just downloads the binary for you into your path so that you can run it. Strongly recommend using this for things like analyze, which we're not showing off today, but is in the repo. You can see that at the URI. We mentioned before, we have an analyze step that checks and makes sure your config is good. But let's get to the good part again. Okay. Oh, actually, no, I was already where I wanted to be. That's the problem. So this is where we're running our demo from. Again, I've forked this repo from Stefan's, and that's actually your first step in your process. If you'd like to do the same thing, fork Stefan's repo. Don't fork mine for various reasons. But you can see that we've got our Istio installation. We've got our apps installed here. We've got cluster definitions. And interestingly, if we look at pull requests, we have a pull request that was just created two hours ago. It says it was by me. It's actually using my credentials, but it was created by a GitHub action that downloaded Istio control, saw that there was a new version, and has updated our control plane specs. So let's take a look at this pull request. The only thing that's changing here is the version of the control plane. And you can see that from the diff here. We can also see that there are two automated checks that have run for this, and one that seems to still be running after... Oh, it's only been six minutes. Oh, because of the changes pushed by the bootstrap. Okay, I'm not gonna wait for the E2E test to pass because I know we've got a lot to go, but normally, obviously, you'd want all of your checks passing. We can see that all of the analyzers passed. So our config looks good. I think we can feel good about merging this pull request. So I'm gonna go ahead and do that. So what we've been doing, we've been running Kubernetes kind inside the GitHub Action Runner. And in there, we are streaming down Istio, bringing down the requests, the limits, everything down, so we can run everything in there and make sure that the new version of the control plane works. And we do that with flux. So flux is deployed inside kind, pulls the branch from the pull request and runs an end-to-end test suite for all the Istio components. Before everything goes out to our staging cluster, we'll do some real testing and only then to production cluster. So we're gonna revisit the cluster in a minute to show you that it actually worked, but it does take a few minutes. So I'm gonna move on and at the beginning of the next demo, we'll circle back to that. What went well here? Our control plane has now kept up to date as long as we're willing to approve pull requests on about a monthly basis. That's not too bad. We got everything tested before we ever pushed to production. So a pretty high degree of certainty that this configuration ought to work really well. And we got analyzers to run so that if we're changing our Istio config, we know before we break production that we had a typo in our virtual service. Which by the way, if you saw in the pull request, there was one called bad sidecar. I very specifically tried to merge some bad config and it's blocked by analyzer. Couple of disadvantages though, what version of the proxy is gonna be running or the data plane is gonna be running in my cluster? Well, the answer is it both ends. Yes, yes, both, maybe. Or one, or the other, we don't know. That version of your proxy in Istio by default is determined when the pod starts up. And that's a little bit problematic because it means all the way through your test infrastructure, your units, your integration test, your staging environment, you don't actually know what proxy you're gonna be running with. And it's a fairly important component. It's something you wanna test about again. So that's not great. Also, this doesn't yet use revision-based upgrades. That is a best practice across the Istio project that I just didn't have time to bake into this repo. Look for that coming in the next couple of months. So let's talk a little bit more about this problem with proxies, right? In the software development life cycle, the proxy is defined at the very end. It doesn't even necessarily result, isn't necessarily the result of a Kubernetes deployment. You could have a deployment that's running an old proxy. You upgrade the control plane. You scale that deployment up. The new pods will get the new proxy. The old pods will still have the old proxy. It's very unpredictable. And as you can see, that reduces determinacy in the whole process. It's very uncontrolled. And the way to fix that is you run global restarts on your cluster, which, like I don't really wanna recommend that you do in production. So let's talk about how we can make this better. What we wanna do is shift the declaration of that proxy or the specification of the proxy left in the software development life cycle to the point of GitOps. GitOps needs to happen at the test environment so that from test to release to deploy to operate, you have a consistent piece of software that you're operating across all those environments. Your test results are meaningful. And then, of course, your rollbacks are just a Git revert. If something goes wrong, you know how to get back. It's not quite so simple if you're using sidecar injection in Istio. How do you define a sidecar in Git? Well, it turns out there's been a command to help you do this for some time in the Istio community called kube-inject. It will take a YAML that could be a deployment or a job or a replica set or a stateful set and it will output a modified version of that deployment that includes a sidecar. Pretty simple. And so we've built that in and baked it into a GitHub action. So what we've already seen is a new version of Istio was available so we upgraded our control plane YAML in Git. They created a pull request. We had kind tests and our analyzed tests. We approved the pull request and it got committed and deployed to Kubernetes. But that commit now, hopefully, we're gonna see in just a second, has triggered a second GitHub action which saw, oh, your control plane version now does not match your data plane version. Let's go ahead and fix that. And so we are going to see a new pull request that updates our workload YAMLs. Actually, it doesn't even update that. It updates a file next to the workload YAML which is gonna go through exactly the same test and release process. Let's take a look. Moment of truth here. Pull requests. All right, update Istio Sidecar to 113.4. Again, our end-to-end test hasn't passed. It's extremely thorough. It actually does a rollout that goes through the entire Canary process and you don't want Canaries to run terribly quickly. But for that reason, I'm not gonna stand around and wait for the end-to-end test to pass today. I wanna upgrade my data plane. Let's go ahead and confirm that. Now I'm gonna double back and show you that our control plane actually did upgrade. What was the command there? IstioD. I didn't show you the before, which was a mistake. The before, oh, actually we did. Here, we were at 113.3 on IstioBase before and now we're still at 113.4. Do a flux get. Run the, am I in the wrong cluster? No, run the command again. Let's see what flux does. Okay, so it only upgraded the CRDs and yeah, we can do a minus, minus, wait. Oops, can type it, wait. No, watch, watch, sorry. I wrote the command and I don't know the name. Okay, so now what flux does slowly moves through all the releases. So it takes, okay, I've upgraded the CRDs and it waits for all the CRDs to be reconciled by the cluster APA. Now it does, okay, has upgraded the gateway, which is the custom resource of the gateway that worked and now it's doing what is doing. Constellation in progress, still the gateway. So you see the revision, how it moves from three through four. And for all these upgrades that flux is doing, once it has upgraded something or it started to upgrade something to let you know through Slack channels, Microsoft Teams and so on. So you can actually see those if you have other things sorted out. Okay, so I'll be. 113.4, yay. And the good news about a nail biter like that is it's given me enough time that I think I should be able to check our data plane and see the same thing updated. We should be at 113, 113.4. So we've gone ahead and done a live upgrade of Istio. We'd separated between the control plane and the data plane. Also one added benefit of having the proxy defined actually in the deployment spec is that Flagger will do progressive delivery across your deployment. So if I check Flagger's canaries, I think it hasn't been too long, they should still be in progress. So on our backend service, we currently have 5% of our traffic going to the new version of the proxy and 95% of our traffic going to the old version. And it's going to periodically check with Prometheus and ask is the service still healthy, shift a little bit more, is the service still healthy, shift a little bit more. And it's going to be really boring, which is one of our favorite things to be in Istio. So let's take a look at our advantages and disadvantages now. Our proxy is now maintained up to date within a semver range. We can automatically apply updates if we want to, like patch updates can roll out without a pull request or they could be stuck with a pull request. It just depends on how your operational like security and safety, how much, how paranoid you want to be. I want to be pretty paranoid personally. Canaries are controlling the rollback and we have automated rollback. If something is wrong with that proxy, the 113.4 proxy will stop executing in the cluster. The canary will fail back to 113.3 and our cluster will stay up even if our Istio upgrade was not great. We still haven't fixed that revision based problem. But when once we do, this is what the new workflow should look like. When a new version of the control plane becomes available, we don't replace the control plane YAML. Instead we create a new revision. This is what you all should be doing if you're upgrading Istio, it's much more secure. It means that they'll run side by side. Two control planes, two different versions. We'll go through the same pull request process and then we will move tags. Tags are like sim links for your revisions. So you might have like a first wave, second wave, third wave. The first wave is gonna get updates first, second wave, second. What we're gonna do is we're gonna find the first wave tag and say first wave tag used to be 113.3. Now first wave tag is 113.4 and all of those proxies as soon as that PR finishes, we'll get moved over and we'll get canary just as in the past. So we think that this will be a good way to keep Istio up to date in production. We obviously, it's still very much a work in progress and there will be continued updates to it over the coming months. But you're welcome to fork the repo and kick the tires on it. If there's any takeaways, if there is one thing that you leave this room with today, I would like for it to be, and I said Istio here, I think we can just replace that with service mesh. Automate your service mesh upgrades. Please, please, please. Or consider paying a vendor to do it. If this is overwhelming to you, Google and many other vendors in this space will take care of your upgrades on your behalf and that is a totally fine way of doing things. If you're not willing to do either of those things, you need to budget engineering time to upgrade Istio. Don't take an open source project, make it core to the application that you're running and then forget about it and not have any operations budget for it. Maybe the 88% of Istio users. Yes, yes. If you're upgrading it, you are now in the top 12%. So congratulations. We want to widen that quite a bit. With that being said, here's some links where you can find more information and I think we actually still have a few minutes for questions, didn't expect that. Really nice talk. We've been upgrading Istio religiously with the Istio operator. How would Istio operator work together with Flux and Fliger? Thank you. You're saying do they work together with Flux and Fliger? I mean, if I've been maintaining it with Istio operator, how would I use Flux and Fliger? I can answer that. So the first version of my repository was using Istio operator and that works great because Fliger and Flux can do with any kind of custom resource, right? It was the same experience, but yeah, so Istio team said, Istio have a chance because the operator all it does inside are doing is the same thing, is running a Helm upgrade and so on and Flux has a dedicated Helm controller which does all things Helm way better than wrapping it out and hiding it. So you have the Helm experience, you can debug it with Helm and so on. So it adds more observability to the whole process. Like you do Flux and get Helm releases then you can actually see, oh, it's stuck at the CRDs or there is a problem with the gateway. The operator will just hide all of that under a single upgrade operation. So that being said, if you're happy with the operator, stick with it and you can look in the repo about three weeks ago, that was what we were doing in the repo. If you're not currently upgrading and you're looking for a way to do that, probably don't look at the operator. We would recommend that new users go to Helm charts. We're still gonna keep maintaining and fixing bugs in the operator, but the future direction is definitely Helm. Any other questions? All right, so we are at the top of our... Okay, one last question. Hello, nice talk, thank you. I just have one question. Would you say that this GitOps approach is better than doing it on infrastructure as code because sometimes for cluster-wide or infrastructure-wide components we usually put these installations in Terra-Firm. But what would you say is an advantage of having it here in GitOps? Why would you use Terra-Firm if you are deploying on Kubernetes? I'm asking you. You're a neatest provider, it's really good. I guess I'll probably have a little bit of a different opinion than Stefan does on this particular topic. For my part, I want to make sure that you're comfortable doing it in the way that you do other upgrades. Istio should not be something that is unique and special in your environment. It should upgrade the same way that every other piece of infrastructure that you have. So if Terra-Firm is your tool of choice for all of your infrastructure, there are Istio... Is it charts or I'm not a Terra-Firm person? There's Terra-Firm stuff for Istio that you can use and you should look at investing there. But if you're already using GitOps or if you're not using anything, then I would encourage you to look at GitOps. So the main difference from my perspective is the fact that Terra-Firm you run it once, but Flux will keep monitoring what's happening. So the difference between GitOps and just doing once and planning is the fact that GitOps does a reconciliation all the time. So if something breaks or something goes on the cluster, someone goes on the cluster, does a kubectl edit of some pilot deployment, and that goes wrong. Flux will just undo it automatically because Flux does not allow divergency, a drift inside the cluster. I'm guessing you don't run Terra-Firm apply every couple of seconds. Well, Flux can do that because it's right there in the cluster and can monitor all the changes that were done outside of Git and it will correct that change and it will let you know. Hey, someone modified Istio, I'm putting it back in the same thing, in the same way it was described there in the Git repo. You're welcome. That was it, right? All right, so we are top of our road and thanks for joining today and a great presentation and demo mission step on. Yeah, I hope you guys stay for fly back and enjoy the rest of the KubeCon.