 Well, super excited to be here, and thank you for joining us for our talk. This is Argo Rollouts from the trenches. My name is Katie Lampkin, and I am a PM at Intuit. I'm Harriet Lawrence, I'm a PM at Red Hat. And today, we are gonna talk about Intuit's journey, migrating to Argo Rollouts, challenges that Red Hat customers have encountered when working with Argo Rollouts, and our combined takeaways and learnings that can help you be successful when migrating to Argo Rollouts. So let's get started, shall we? So I wanna do a really quick recap of what Argo Rollouts is. It's a Kubernetes controller that provides advanced deployment capabilities, like canary deployments, blue-green deployments, AB experimentation. We also have complex traffic management, and we have analysis that allows for automated promotion and automated rollbacks. So at a high level, how does it work? What is the flow? So as you see the diagram here, you start your deployment of application, right? And you can start off your deployment, you can set it to a subset of users if you're utilizing the canary pattern. Now, your automated analysis is going to be executing in the background, and it's gonna be querying your data store. Your data store, for example, can be something like Prometheus. From Prometheus, we will send out the query and we'll say, hey, how are my metrics doing? How are the health metrics that are driving the health of my application performing? Do I meet my thresholds? If I do, great, let's continue on with my deployment. It will check that at each step of my deployment, and as things continue to move on successfully, it will continue on with the deployment until it successfully completes. As soon as my metrics don't look good, they fail below my thresholds, then we will redirect all traffic back to the stable deployment. Now, what do you achieve by using Oracle Reloads from using these complex deployment patterns? Well, you go from slow and tedious rollbacks, often with a long time for mean time to recovery, going from hours to a fully automated rollback in minutes. You go from manually monitoring your metrics, your health checks, to having automation in place that is continuously modeling that for you, and evaluates the status of your new deployments, and overall, you are reducing your impact to customers when things go wrong. So let's talk about Intuit's migration. So where did we start? Intuit had services that were running on bare bones EC2. We had some services that were deployment objects on the Intuit flavor of Kubernetes. We had some early adopters of Argo Reloads that were utilizing blue-green deployments. Some were utilizing canary deployments, but not having a standard strategy was something that made it complex for our engineers to manage. So we knew we had to have a target state. So what is our target state? Our target state is for 100% of our services to be on Argo Reloads utilizing canarian analysis. So you may wonder why canarian analysis? Well, for some of the reasons that I stated before, canary allows for reduced risk of exposing customers to change-induced incidents via progressive delivery. So with canary, those stepwise deployments, you can start by sending a small subset of your traffic to the canary deployment. You can analyze those metrics via Argo Reloads, and if something goes wrong, you only affected that small percent of production traffic. Now, we understand with this goal of 100%, it is a stretch. There may be expectations to this rule, but we got to go big, right? So let's talk about our migration strategy. We started by updating what we call our day zero templates. So day zero is referring to when I'm an engineer and I'm about to create a new service from the ground up, we go to our templates and they essentially press a button and their repos get set up for them. They have their application source code, they have their Kubernetes manifests that right out from the gate, we updated to include canary analysis with a rollout objects. So awesome, they get that out of the gate. Now, what about the services that aren't starting from day zero? They're starting from day one. They already exist and they're already doing something today. We noticed that most of our services were utilizing Kubernetes deployments and so that's the next thing that we attacked, right? We created internally a CLI that we call a rollouts converter that allows us to open PRs on repos that have adopted the Kubernetes deployment object and that PR contains the manifest changes that are needed to adopt our rollouts canary analysis. From that point on, what the developers of the service repos need to do is essentially test their branch, make sure the thresholds that are set up for the analysis templates align with the current metric patterns of their service. Once everything looks good, they can then merge that branch into their main branch and then promote it throughout their environments as normal. Usually this process takes about two weeks to get it into production, but a majority of that process is just the release process that goes on into it. Probably about four hours is spent through actual development time and actual hands-on time to make this happen. So very little overhead for the developer to manage this change. We then had those services that were early adopters to canary, which was great. We added another feature to that CLI that then says, hey, you know, when you run this command, please open a PR on our GitHub repositories to add the appropriate analysis templates. They then go through that same process of looking at the PR, testing to make sure their thresholds look good, merging in, and then sending it all the way through to production. So what's next? So we have noticed a lot of great success with our services that our health checks are contained utilizing golden signals. We also noticed that we often encounter some change-induced incidents that could be related to business-based metrics. And so for the small set of incidents that we have that are change-induced, we wanna start looking in the future like what other business-based metrics can we help drive within the teams to add that into our rollback strategy? You may have seen some other talks from last ArgoCon and the previous KubeCon that talked about how Intuit was looking into AIOps as well. We are incorporating a lot of our golden signals metrics into a unified anomaly score and utilizing that unified anomaly score to drive the health of our rollback applications as well. Throughout this migration, we came across an interesting edge case and I wanted to share that with you guys today. So if you look at this diagram I have on screen, this is what a normal first step in a canary deployment looks like. Traffic is being routed to the canary. In this case, my first step in the canary deployment sends 25% of my traffic to the canary. We then have rollouts continuously checking our metrics data sore, which in this case is Prometheus. The thresholds are coming back good. We have those three nines. We're in the clear. We're gonna start moving forward with our canary deployment. But we had a couple teams who are migrating from blue-green deployments to canary deployments that didn't necessarily wanna give up one of their favorite features of blue-green. Their favorite feature of blue-green was the fact that you can deploy a service to production. It's not taking any production traffic and they can route test traffic and test that service live in production without it affecting any other production traffic. It's a pretty great feature, right? I can see why they wouldn't necessarily wanna give that up. And so they wanted to look at how can they take that feature but also make it work with canary analysis because they saw the benefits of canary analysis, right? Being able to send a subset of traffic and only have that subset of traffic be affected for your customers and being able to roll back, there's huge benefits versus blue-greens you have to route over 100% of your traffic, right? So what did they do? Well, what I call it is I like to say they implement an initial green step. So their first step in their canary deployment is they boot up the canary but they don't send any traffic to the canary. They would then set up their traffic controllers and traffic controllers like AWS, ALB, Istio service mesh, et cetera, to allow them to send test traffic to that canary deployment, very similar to how they would send it over with a blue-green deployment. Then once those tests executed, they were fully satisfied that their tests have passed. They then would say, okay, great, I'm going to move forward with the next step in my canary deployment. So here's a quick snippet of an example of one of the repositories that utilizes this internally. You can see the first step sets the canary scale at 90 replicas. Now, obviously this is a service that has a lot of traffic, hence why we're using 90 pods. And then we have a pause that's in a definite pause. So that pause, it doesn't run for a certain amount of minutes. It's gonna pause until you tell it it needs to move forward or the rollback gets aborted. So the tests will then execute, right? They'll execute via Jenkins, via their test execution platform. Once the tests come back successful, we're in the clear, woo-hoo. Then the test will tell the rollback, all right, you're good to go. You're good to move on to the next step in my deployment, which you see we set the weight to 5%. All right, now let's talk about scaling up. So when you're scaling up and you're using rollouts, there are some challenges that we encountered. And one of the big ones that we encountered into it was centrally managing our standards. When we started our journey, the deployment manifest of these Kubernetes manifests were individual GitHub repos and were owned by each individual service developer that owned these applications. So that meant as our standards evolved and we got complex and powerful with the health checks and the metrics that we were using to drive our rollouts and our rollbacks, that when we wanted to update these standards, we needed to go out and we needed to essentially open a PR on each individual service manifest repository. Now this can add a little bit of overhead and a little bit of tech debt on service developers because they need to then again go back, review the PR, make sure it looks good, send it throughout their release process. And this is overhead that we don't want to have on our developers. We want people to be able to opt into our standards, have the ability to opt out of our standards if they wanna build their own journey or if they want to adopt our standards, but they need to adjust the thresholds a little bit to just match their service specifically, then we wanna be able to allow for that as well. And now Harriet will take over the rest. Thanks, Katie. All right, so just like with any other tool, knowledge sharing in rollouts starts to become a challenge once you move beyond a POC or a single team. So when you're adopting rollouts, don't forget to allow time and training resources for teams to learn. I know it sounds really obvious, but it can slip through the cracks. It's especially important if you've got multiple teams managing their own rollouts. And when you're using rollouts in prod, it becomes a critical part of your infrastructure. And when you're introducing automation to progressive deployments, it also means you're reducing how much your team really needs to understand about how it works. So be careful that that understanding doesn't drop off entirely and you're left with a single person who understands the inner workings of your deployment process. The last one we've got here is access control. So I'm gonna talk about this assuming that we're using role-based access control and specifically RBAC about its bindings to users and not how it works internally for rollouts. RBAC is far and away the most common type I see folks using with Argo and it's definitely the easiest to get started with. Even if you've got rollouts set up, it's working, it's even scaled up already, it's always a good time to go and review your access control set up. Give it a checkup, make sure it's still suitable for your teams. There are definitely some wrong ways to do access control that can get you on a pickle. So please don't go and give everyone in your organization wildcard admin access. Even if you are a really tiny startup, get yourself off to a solid start with some defined roles. Your HR person or your new engineer on that first day does not need permission to trigger a rollout promotion in prod. So once you start scaling it out past the first group of people who are using rollouts, you'll be glad that you've got some roles in place. The rollouts also introduces a few more elements to your access control equation. So even if you've already got your CI and CD access control sorted out, be that with any Argo project or with other tools, Argo rollouts can potentially bring in a traffic manager, an analysis provider and a new UI to manage. And if you're already working at scale, you've probably already got something like Prometheus for your data and you'll need to assess the roles that you've already got for those tools and figure out if they can be reused or if you need new ones for new types of access. So when I'm working with our customers at Red Hat, both those who are already using rollouts and those who are figuring out how to integrate it with their current deployment strategy, access control is the topic that comes up most frequently. How is everyone else doing it? How should we be doing it? And unfortunately, there's no one size fits all answer to how it should be done, but there are a bunch of things to take into consideration. So when you're looking at your access control, it needs to be based on how is your organization structured? What does each team do and who is on each team? Are you in an industry that is really strictly regulated? What does your threat model look like? And if you don't have a threat model yet, please do go and look into making one. And finally, for the threats that you have identified, what is your risk tolerance for each one? And once you've thought through these, you'll have an understanding of your situation as a whole and you can answer rollout specific questions like which teams or who on a specific team should be able to add a rollout for an application? And can this be left up to your development team? If so, do they already have access to the GitOps repo or will they need additional access? Is your platform team abstracting a way application creation with an internal developer platform? Do you have multiple service accounts running behind the scenes that will need updated permissions? So who should be able to promote a rollout and are these the same people who want to observe a rollout? You could lock down promotions in production and perhaps allow any non-prod environments open access to do promotions through canaries. And that's a really popular choice for teams who are deploying to dev really frequently. Keeping permissions open in non-prod gives folks freedom to experiment and test while ensuring nothing gets accidentally rolled out into prod or deployed faster than you wanted via an accidental promotion. This is a really useful control to have in place if you're working in say finance or healthcare, any industry that has really tight regulatory requirements and strict auditing. Obviously your auditor is going to be thrilled with you if you've locked down absolutely everything, no one can make any changes without multiple steps of authentication but that's not really practical. And it's all about striking the right balance for you and your teams between convenience and security. So who can choose the metrics that each team uses? Should they be centralized like Intuit is doing? And perhaps the context and knowledge to choose the right metrics is centralized or perhaps it is spread out over multiple teams. If you do want your developers to be choosing their own metrics, do they need to collaborate with your DevOps team before the metrics are relied on for production deployments? Or do you need to require a review from an SRE for anything that touches deployments not just adding metrics? Do you need to perhaps delegate metrics selection to a team member who has SRE experience on each developer team? And they could also have the context to know which metrics are important for their service or application. So once you've chosen your metrics and the thresholds that are acceptable for your service, you can define an analysis template which is then executed as an analysis run. So who will need to see the analysis runs? And if you get a failure, who needs to know about it? Understanding the results of an analysis run isn't as easy as it could be and perhaps that understanding is only in the domain of your DevOps team or an embedded SRE team member at the moment. So the folks who need to know that their rollout failed may not actually need or want access to the analysis run output or even to the rollout UI. So how do we apply this information to our teams and across our organizations? So when you're introducing progressive deployments or migrating from deployment CRs to rollout CRs, this one's really an evergreen recommendation. Don't boil the ocean and try and do it all in one go. I am yet to come across a customer who genuinely cannot do an incremental deployment of a new technology. There are plenty of teams who start off thinking that it's the only way to do it or that it's the best option for their situation but there is always smaller steps to break it down and iterate on it. And there's that classic MVP example. If you're trying to build a car, don't build the wheels, then the chassis, then the exterior, iterate on it and have something useful at each step, like a skateboard, then a bike, then a motorbike and finally a car. So how much do you actually need to scale to suit your team? Pre-optimizing beyond where you would ever need to scale is really a waste of everyone's time. So you go and assess your product and its goals. What resources do you already have available either in the cloud or on-prem? What skills does your team have and what do they want to learn? How much money do you want to spend on infrastructure or managed services? And just like with introducing rollouts, scale up incrementally. Don't try and jump all the way to the end. And finally, it's always a good time to do an audit of the access control needs for your team and your wider org. Even if you've already got a really solid access control system in place, when was that designed and has anything changed since then? So go and poll for your current state, dip it with your desired state, and then you can progressively roll out any changes. Thank you. Do we have time for questions? Yeah. Yeah, I think we've got a couple of spare minutes. If anyone's got any questions. Hi. Is it possible to deploy Argo rollouts through OpenShift Githubs? Soon. Thanks. What options do you see to do the traffic control to the cannery in the moment when you do your tests? Is there always some, is Tio needed or is there some native methods to do that? So my understanding is you do need traffic, you don't need traffic management in place, Zach, you want to help me out? So I would need a service mesh to be able to distribute the traffic. It's not something that rollouts can facilitate. Well, it depends on how fine grain you want to distribute your traffic, right? So you can implement basic cannery and basic cannery will have some sort of method based off of like the number of pods you have to essentially say, hey, if I have four pods. Yeah, that's kind of just faking it, right? Exactly, so exactly. So if you want to be as precise as possible, then you want to integrate with the traffic manager and set the weights based off your traffic manager. Cool. The goal would be just to direct some traffic directly at it. So at the cannery. Like pretty much like you described in the production example, but just for staging system to validate basically a QA environment. Gotcha, I would still recommend with the traffic manager. Thanks. Anyone else? Is there a way or a method coming in the future to natively use Argo rollouts with services that do not expose an HTTP API? So it's metric based, of course, but pretty much every example and every technological implementation I've seen assumes that it is about a microservice with a web server built in. We also have services that do not have any endpoints that are essentially blind services. Yeah, absolutely. This is something that, you know, that into it the maintainers are looking into. Unfortunately, it's a hard problem to solve, but it's something that we are hopefully looking to prioritize, come FY24 and see what we can do for specifically what we'll be looking at is for workers. So applications that will be pulling from queues, streams, so on and so forth, which is kind of the, you know, one of the next upcoming most common use cases for a long running service that we'll be running on Kubernetes. You can also become a contributor and help out. Yeah. Okay, thank you very much. If you have any other questions, please catch ladies off stage and give them an applause. Thank you.