 We're a Golang developer in the fall of 2019. You probably know of me. I was a Nugler at the time, and I had been working on the Istio project for about a year. There was a dependency library I needed to update for a fix I was pushing out, and the owner of the library was a Googler, so he was happy to give me access ownership to the repo, and I went ahead and pushed out version 1.0.4 of the Pflags library. Yes, that Pflags library. The one, let's see if we can zoom here. What have we got here? 213,183 dependents on GitHub. So I pushed out my change, and as luck would have it, it contained a bug, but that's okay because I followed best practices, and my push was a pre-release on GitHub, that beautiful little checkbox that you can do when you want something. Little did I know, however, that the Go mod system that pulls in dependencies for builds doesn't know anything about GitHub's checkbox, and so on September 17th, 2019, every automated Go build on the internet went ahead and pulled in my broken change and broke itself, failed to build. Now let's take a look at what projects I broke. This is the CNCF project velocity graph, so up into the right is where we wanna be looking right now. We've got Kubernetes, yeah, they use Pflags, open telemetry, yep, Cilium, Argo, yes, Istio, Mechery, Knative, GRPC, Nats, yes, yes, yes. As a matter of fact, I think the only project that I'm certain does not use Pflags, that's up here right now, is Envoy, and that's because it's written in C and C++, so not really available to them. So these are the projects that I broke, and of course, I didn't know that I broke them because like a boss, I pushed my changes, published my release, and went home for the night. You can imagine how I felt the next morning when I got into work to hundreds and hundreds of these. Owners of the various projects I had broken, very kindly requesting that I improve my release processes and maybe fix their change. Well, I'm not one to be deterred, I immediately opened up VS Code, got to work, fixed the problem, pushed it up, and I'm smart, I don't want a public record of my failures on the internet, so I'm gonna push it out as V104 again. Just cover that right up, no one needs to know that I broke the internet. So that caused this issue, which no one had ever seen before, there was a new checksum system that had been rolled out to Golang two weeks prior. I was the first person in the world to break it, which was pretty exciting, and everyone got this terrifying security error saying that someone was doing fishy with their dependencies. Well, that someone was me. What went wrong in this story? Was it the lack of automated testing so that I had confidence that I was not breaking the internet when I published my release? I mean, I would have loved that, that would be great. By the way, PFLAGS is still that bad, so if you're interested in getting started in open source and you like writing tests, you can help us out there. Was it the GitHub semver mismatch, the fact that GitHub called it a pre-release and Golang didn't? That's a problem that still exists today. By the way, you new software engineers don't trust the checkbox, but I really don't think that's ultimately the cause. It also pointed out that all of those projects that I broke had not frozen their dependencies. They were pulling in fresh versions of their dependencies on every automated build, but that was the standard practice in the Golang ecosystem in 2019. I can say that I am partly responsible for the fact that everyone has a package lock these days in their Go project, not responsible in the way that I would like, but responsible nonetheless. I think that what really happened, the true root cause here that we need to talk about was that there was not an automated, structured way for me to get my code safely from source to production or release in this case. I should not have been going into the UI and checking boxes saying, sure, ship it. I feel like naming it this today and I feel like making it a pre-release today. This should have just been a pipeline that picked things up and pushed them out and made them available. So today, in the next 25 minutes, my colleague Christian and I are going to show you how if you're operating Istio, you can build a pipeline with ambient mode and Argo CD that will keep you from breaking yourself the way that I broke the internet. Let's get started. My name is Mitch Connors. I'm a senior principal engineer at Aviatrix. I'm also a product manager there where I own container networking and platform engineering. I've been on the Istio project for five years now where I'm UX lead and on the TOC and this year I'm serving as a CNCF ambassador, which has been a lot of fun. Christian, why don't you introduce yourself? Yeah, so my name is Christian Hernandez. I am the head of community over at Acuity and I'm an Argo project member. I'm also part of the marketing SIG at the Argo project, which takes care of things like putting on ArgoCon, which is going on the other side of this building here. I'm also maintainer of OpenGetOps and I'm a guitar player and a guine enthusiast. So like if you guys are tired of talking about tech and you want to talk about this other stuff, you can have a hallway track there as well. So I think also this isn't just like a story. In this case, we're actually built something you can use, right? So you can go ahead and scan the QR code. You can use it as a reference architecture and I believe you're trying to get it into the Istio. This will actually point you to the Istio repository, thankfully to a few of you who approved my pull request Friday night way too late. This is our reference architecture that we're gonna be showing off today. You can pull it down from Istio and make use of it yourselves and you can follow along right now in the code that we're gonna be using. I'll give you a warning. I went a little bit QR code crazy. My kids taught me how to use them this week. So keep your cameras and phones ready. All right, where are we going? Christian is gonna share with us a little bit about how GitOps and Argo CD and platform engineering are all related concepts and how they matter for the Istio project. Then we're gonna talk about what has changed in ambient mode about upgrading Istio. Why is it easier to operate Istio in ambient mode than sidecar mode? We're gonna talk about things that you should consider when planning your Istio upgrades and then we're gonna bring it all together with a series of three demos on a live site on the internet. So let's get started. All right, so I'm gonna talk about the GitOps principles. Now I can do like a whole talk. I can do a whole conference on GitOps principles and what they are in deep meaning, but I'm gonna go kind of high level here and then we can do a hallway track if you want me to get deeper on any one of these things. But kind of the story behind GitOps and the GitOps principles, if you wanna learn more, opengitops.dev is a lot of our practitioners at that time a few years ago, members from the Argo community, members from the Flux community, people from like Red Hat, AWS, all these interested parties got together and tried to define what GitOps is, right? And so at that time, Kubernetes was kind of just blowing up and people were trying to operationalizing Kubernetes. So we came up with some of these principles of what it means to be GitOps, right? As the first principle is it needs to be declarative, right? So a system managed by GitOps needs to have its desired state expressed declaratively. All right, so now as you know, we have a declarative infrastructure that is Kubernetes, people were still doing things imperatively. We said, no, you have to leverage the declarative nature of Kubernetes, which kind of brings on to the next principle, which is it needs to be version and immutable. Meaning the desired state is stored in a way that enforces immutability, versioning, and retains a complete history, right? This is where the Git in GitOps comes from, right? Because that's kind of what the Git gives you there as well. Other things are fine to use as well like S3 and things like that. But as long as it's version and immutable, you're following the principles. So now number three, pulled automatically. This I do wanna get a little bit deeper, maybe an inch and a half, maybe not an inch deep, but pulled versus pull, you're probably thinking, does it matter how I apply the manifest either in the pull or pull, that's not what we're talking about. We're talking about that the declarations themselves, the manifests, i.e. the YAML, needs to be pulled into the system. So the only reason for that is, and we're sticklers on that, is to differentiate it from an event-based type of workflow or for like a webhooks. So not to say that you're not gonna be using webhooks in your GitOps workflow, you absolutely are. It's just using solely webhooks isn't GitOps because it needs to be, number four, continuously reconciled. Software agents, the two most popular is Argos, CDN, Flux, are continuously observed the system and attempt to apply the manifest. So reconcile the running state and the desired state. So again, I can go on and on about GitOps, opengithops.dev, join a community meeting there, pull me on the hallway track, I can do that as well. So I think some of the roles we're gonna talk about like in this talk here is the role of a platform engineer provides the actual infrastructure, provides the system and their customer is the actual application developer, the engineer, the internal engineer is their customer and they don't necessarily want to be geniuses in platform infrastructure. They don't necessarily care, a lot of engineers don't and really they just wanna write their code, consume templates and basically utilize the platform. And so the platform engineer, really all they care about their world is that they wanna patch vulnerabilities, they wanna keep the system up and running, they want the lights on, they wanna upgrade without disturbing any of the engineers. That's their primary job, their primary goal. The app developer doesn't wanna learn service mesh. So it's kind of, when service mesh first came out, issue first came out, it was this big thing and everyone thought developers were gonna love it. It's actually more for the platform engineers, right? So they don't wanna learn service mesh but what they wanna do is they want to be able to leverage service mesh. They wanna be able to consume all the things that service mesh provides, they don't necessarily wanna manage it. So which kind of leads me to the Argo project. I'm not going to go too deeply into this, just know that the Argo project is a suite of tools that operationalizes Kubernetes, right? There's things like workflows, events, CD rollouts. In the SEO world, if you're in the SEO world and you're working with platform engineers, you'll hear a lot about Argo CD and Argo rollouts. This is kind of like the where, where it plays in to there but that's what the Argo project aimed to do when it was first created, is to operationalize Kubernetes in a get-ups way. So Argo CD really tailors to both platform engineers and developers, right? It has a feature-rich UI, it has the health and monitoring and the multi-cluster, multi-tenant capabilities that platform engineers want. Advanced deployment patterns and extensibility and integration that developers want. Be able to see in a single UI for both platform engineers and developers was the end goal for Argo CD and it tailors to both teams. So I think I'll hand it back to Mitch to talk about Ambient Mesh. Yeah, so you all heard from John this morning about Ambient Mesh and its architecture. I'm not gonna dive deep into that. If you missed it, check out the video, it's a great talk. I do wanna talk about what our objectives were in building Ambient. They were three-fold. We wanted to make onboarding easier. We wanted to make operations easier and we wanted to make resource utilization on your cluster easier. For this talk, we're only going to be talking about that middle goal, reducing operational friction, but that's not all there is to Ambient. I wanna make sure that that's clear before we get started. We've had Sidecar Mode in Istio for quite some time now and a lot of us have gotten pretty accustomed to it and a lot of what we're showing off today works in Sidecar Mode. For instance, both support multiple versions of the control plane in the cluster. We call that a Canary Upgrade for your Istio version and you can do that in Sidecar Mode. You control which data planes are talking to which version of the control plane with tags and revisions. We're gonna dig into those quite a bit in the next couple of demos, but I've got their support as sort of inside Car Mode. They work inside Car Mode, but only at pod startup time. You start a pod, it gets the version it's tagged points to. You later change what that tag points to, the pod doesn't care. So you may think that you've upgraded all of Istio inside Car Mode. You've applied all of your Helm charts and everything else only to find out that that CVE that you're urgently trying to patch is still actively being exploited because none of your Envoy Proxies have upgraded. You need to follow a final step which is to restart all of those pods to allow them to be re-injected. That's completely non declarative. We call that an imperative action and it makes Istio in Sidecar Mode rather difficult to operate in a CICD or in a get-up sort of scenario. As a matter of fact, we heard earlier this morning from the first talk. I lost the name of their company, Devrev. They talked about how they built something that would automatically detect. Oh, the Sidecars are at this version and I see a control plane at this version so I'm gonna nuke the Sidecars one at a time until they all get restarted. That's a great solution to the problem. That's a solution I've implemented in production in the past. I'd really love for nobody else to ever have to implement that solution again. What we need to do is implement declarative upgrades for the projects and I think Ambient is going to give us that. How does Ambient give us that? Well, we've divided, as we mentioned, our data plane into two components, our Layer 4 or Z-Tunnel component and our Layer 7 component, which we call the Waypoint. It's still implemented by Envoy. You all know and love Envoy already, I'm sure, so you're familiar with the fact that, oh, Envoy isn't written in Go. Sorry, I've got that slide wrong. It's written in C++. It's extremely complex because it's so powerful. You can do WebAssembly running in your Envoy. You can do a million different load balancing profiles in your Envoy. You can do just about anything. I've seen people literally embed their app into Envoy, so it's not actually a proxy. It's just running their application server in process. That means that stability of Envoy is quite low. There's a lot to keep moving. There's a lot of features being added all the time and if you've watched the CVEs out of the Envoy project, you know that they tend to be related to Layer 7 functionality. Layer 4 has always been much simpler. There's just not a lot going on at Layer 4 and so what the Istio project has done is moved our Layer 4 processing into a purpose-built binary built in Rust that doesn't do much. It's pretty simple and that means it's extremely stable. For instance, we're going to look real quickly at the HTTP rapid reset bug that you all probably have been having to deal with in production for the last four months. In Envoy, we got that patched as the zero day was announced and pushed out to the community, way to go release managers for Istio. In the Z-Tunnel, it was patched, I think, in May. Before anyone even knew that it was a real vulnerability, it just looked like a bug in the code and it got patched and went away. That's what you can expect from your Z-Tunnel and therefore the Z-Tunnel is going to be run once per cluster. We're gonna say you get one version of your Layer 4 data plane and you get one instance of it per node. It's that stable, it's that efficient. You can trust it. Envoy's, you shouldn't trust. You're gonna need to run a few versions concurrently. When you go to upgrade them, you're gonna need to take baby steps in that upgrade, upgrade this thing and see what happens and then upgrade that thing and see what happens. With the Z-Tunnel, we expect to provide the kind of stability that says, you can just say cluster was on A, now it's on B, done, I can go home like I did on Friday and hopefully it goes better for you than it did for me. There's a few other things though that we're operating here. Just like inside car mode, you can run many control planes or IstioDs per cluster. You can have many tags and revisions referring to those control planes, but the CNI and Z-Tunnel are gonna be only once per cluster. Let's take a look. If you've opened that GitHub link, you've probably seen a whole lot of YAML. So this is like your roadmap to the YAML. If it's a little bit overwhelming, I apologize, but there's only a few things you really need to pay attention to. One, meta-application.yaml. This is our bootstrap file. It's the only thing you need to cube-cut or apply once you have an Argo instance up. Everything else gets pulled in automatically through GitOps. So there's your bootstrap and we have two folders that we're bootstrapping primarily. We have our application folder. This is just our sample book info app. Our app dev owns this space. He's running three different gateways. Two of them are waypoints that we just talked about for layer seven. One of them is an Ingress gateway that works pretty much the same way as it does in sidecar mode. And then we've got our Istio folder, which is owned by our platform engineer. And it's a little bit more complicated. She's got to manage CNI, Z-Tunnel, the set of control planes that we run, which is gonna be more than one, as well as all of our tags and revisions, and then some extra stuff in case you wanna show off your demo. I'm not gonna show off that bit today. Let's talk a little bit about how tags and revisions work. We've been sort of flirting with this concept, and I know that not everyone has made use of them even in sidecar mode yet. A revision is just a name for a control plane. You install control plane 118.1, it will need a revision name. And that name should not change ever. And you should not reuse that name ever. That is that control plane's name. It is stable and immutable. And you can actually use that in this Istio.io rev label on your objects. That's not what we're gonna demo, because it refers to a specific version, and that's not actually how we want our app to just interact. Tags, however, are like the opposite. They're completely mutable. They're sim links to your revisions. And so in this case, we're using the stable tag. And when we wanna upgrade this gateway, we're gonna go and update the stable tag and say it was 117, now it's 118, and everything just upgrades. All right? Everybody with me there? Okay. So this is our layout right now. All three gateways are either explicitly using the stable tag, or they're not using any tag, in which case they get default, which happens to be pointed to the same as stable, this DoD 118.3. We also have a rapid tag available, but nobody's making use of it in 119.1. Now these versions, you probably know, should not be in your production right now. Because of that rapid reset bug. So let's get to our first demo. We wanna go ahead and fix that rapid reset problem. I'm sure you're familiar with this vulnerability. Oh, and by the way, I told you this would be live on the internet. Here is the QR code link to take you to my little locus site. This is a load generator that's actively running against our application. So if that line, if the red line climbs and the green line falls, you all will know that I have done something wrong. And you'll know it probably before I do. All right, here's locus right now. We're looking pretty good. Notice I'm still on Pacific time, so everything looks a little off here, but everything's green. And we're gonna go into Argo so that I show you the particular version of that gateway. We'll just pick one of them. We don't need to see all of them. Our Bookinfo gateway is running Istio proxy version. I know this is pretty small here, but if you can't see it in the back, it's 118.3. That's our vulnerable version. Here's our pull request for patching that. Oh, sorry, we have a two-step process in the person of our platform engineer. First is we need to deploy the patched control planes. Then we need to make use of the patched control planes. You don't wanna do that at the same time. Control planes will spin up looking for something that doesn't exist. So step one, put 118.5 and 119.3 out there. Let's go ahead and merge that. It's when demo gods, we go. Yep, this is live. Let's come back here and we should see. Oh man, it's already there. That's how fast this thing is. 118.5 and 119.3 are spinning up. They're gonna turn green in just a second. In the meantime, we will go ahead and get to our second pull request. And let's take a look at what that changes. Now we're taking our tags file. Our default tag was pointing at 18.3. Our stable tag was 18.3. Our rapid tag was 19.1. And we're just updating those to those versions of the control plane that we just deployed. This little shim here at the bottom is going away. You don't need to worry about it after 120 releases. It's just a patch of bug in 118 and 119. All right, and let's go check out that gateway again. Oh, look at that. It's already spinning up the new version. Now it's tearing down the old version. We can open up this new version and scroll down to our image tag, which is way down here. Not that far down here. 118.5. We've been patched. Check locust to see how we're doing. Looks like we got a little bit of a throughput hiccup here. Remember that this is still an alpha product. Ambient should not be in your production cluster. Please don't put it there. I don't want to get that ticket on GitHub. No failures though. It's just a little slow. Yep, so that'll pick back up as those proxies warm up. And that's our first demo. All right, so what we just did was take those three proxies and upgrade them. We spun up the new versions of the control plane and then we changed what the tags were pointing to. And we did all of that in the person of our platform engineer. What was our app dev up to at the time? We don't really care. He was out to lunch. He was watching the latest movie. He's not involved in patching Istio. And this is critical. Your platform engineers need to be empowered to unilaterally take action on security vulnerabilities in production. They cannot depend on your app devs to do that. Likewise, we can't really interrupt your app devs to do that. All right, we're gonna take a minute to talk briefly about upgrade planning. We think there are two good ways to plan your upgrade in ambient mode. The channels and phases. In the channels mode, you always have two versions of Istio running, two minor versions of Istio running in your cluster. When you upgrade, when 120 rolls out, it'll go to rapid and 119 will be stable. Then after three months, the next Istio release comes out, 121, rapid will go to 121, stable goes to 120. The advantages here are your app dev can actually choose between rapid and stable and get different features. If they want the latest, greatest, they can ride on rapid and live dangerously. If they want something that just works and they don't have to think about it, they can stay on stable and that's what they'll get by default. Also, by the time 120 gets into stable, it's had a three month bake time. So you're pretty confident at that point. On the flip side, it's pretty complicated. You've got a lot going on in your cluster, a lot of control planes to be spinning up. It might be a little bit much to explain to people. If you don't like that, you can do a phased approach. This has as many tags as you like. I'm only demoing two here. You get a new release, you start with rapid, then you go to regular, then you go to stable, probably like a day apart from one another. So that the stable state of your cluster is that all of those tags point to the same version of Istio, but you have an ordered upgrade. I'm gonna upgrade this set of proxies and this set of proxies and this set and you have control. You can also do phased strategies within channels, but that's even more complicated and I'm not gonna demo it today. So this is our app. We've got those three gateways that we talked about. You've already seen Book Info today, so I'm not gonna go into it really deeply, but it's important to note that we have everything on the stable revision or missing a tag altogether and therefore on the stable revision. This is the current state of our cluster and we are now ready for our second demo. In this case, our app dev wants to get the latest greatest features like we talked about. They're unstable, they wanna move to rapid. Let's see what that looks like. All right. Oh, we did get a handful of failed requests in our upgrade there. All the more reason not to push this to production. I'll still consider that four nines though. It was just a bit. Oh yeah, oh yeah, we got it. We got it. Here's our entire pull request. It's two lines. We had no labels and we're adding labels for rapid. That's it. Let's commit it. While he's committing that, I'd like to point out like this is all done through Git, right? Upgrading through Git, you get that audit trail, you get that who did what and when and at what time and who approved it and all that, you get that audit trail as well. So doing this declaratively has those other advantages as well. And Christian isn't talking up Argo CD enough. The explanation that he just gave took twice as long as the actual upgrade of our proxy. It's updated. We're now on rapid and I'm not gonna scroll around here long enough to find the version because we're running low on time. Locus, we look good. All right, now we've done this with our app dev and our platform engineer once again, completely uninvolved in the process. The app dev doesn't need to consult them. They can see what the rapid version points to. They can go to Istio.io and see what features that are available and say that's what I wanna use. That's it. All right, so here's what our tags look like now. We do have the reviews waypoint on tag rapid getting IstioD 193, but we've got our last demo now. I know we're running short on time. We wanna do a full minor version upgrade. This is the one that's tough. This is the one that everybody struggles with. And so we wanna make sure that we prove that this thing works out. We're gonna do it in two pull requests. The first one is a multi-part pull request. It's the largest you're gonna see and the second's a little bit smaller. Let's get started. Oh, sorry, I guess I'm gonna illustrate it first and then do the demo. So what we're gonna do, we're gonna deploy the 120 beta control plane and move our tags up to IstioD 191 and 120. That's all of both of those pull requests combined. We're gonna do it in two separate steps here. All right, first step, we're taking, this is our CNI, 118.3 to 119. Oh, that should be five, but oh well. And then this little bug fix moves to 119.3 and we add our 120 control plane. Pretty simple. Still our largest change so far. I think at four lines. We'll merge that, come back here and watch. And we're gonna see 120 show up very soon here. Oh, it showed up before I got there. All right, 120 is spinning up. That's looking good. It's gonna become healthy in just a moment so we can move on to the next and final pull request of our demo with 10 seconds left on the clock. Yes, yes. Doing good. Our default revision gets pointed to 119. Our stable revision gets pointed to 119. And our rapid revision gets pointed to 120. By the way, this is the beta. Again, don't do this at home, folks. It's a beta release of an alpha product. So that's even worse. Yeah, but we're doing it and we're gonna see what happens. Let's go check out our gateways. Ooh, I got here before they actually spun up so you can see the new pods starting, those blue boxes. And as they become available, the green boxes will become blue and start tearing down and we can check out our reviews gateway. I think is the one that should be on 120. Let's have a look. All right, there it is, 120 beta zero. We have gone ahead and done our minor upgrade and we've got a little bit of hit on latency. We might even see a handful of failed requests. Oh, by the way, there's one extra demo somebody mentioned that I ought to do and that is the, oh no, this broke production demo. Here's how we unbreak production. We click revert, create. I know that there's probably too many clicks to this. We should talk to GitHub about GitOps needing fewer clicks for rolling back, four whole clicks. All right, now we're gonna start immediately rolling back and we come here. There we go. Now we're rolling back our production off of those versions. Our production should get fixed about 10 seconds after we noticed an issue. So that's how we can upgrade Istio Ambient using Argo. I hope you can see how this gives you a lot more power than what we had in sidecar mode and I hope I can see that I'm not currently on my slide deck. You can check out the repo with the link on the left. We would love to hear feedback from you all on our talk at the link on the right and I don't know, I doubt we have time for questions. What do you think to see that, Zach? Okay, we got time for one or two. All right, well in that case, Zach and I are gonna be around hallway track. Feel free to hit us up. Thank you all.