 All right. Bonjour, Argonauts. What's going on? We're here to talk about scaling cloud-native CI-CD pipelines with Jenkins and Argo. And last year, ArgoCon, Bertrand and I presented on how to migrate from Jenkins to Argo for CI and some of the pros and cons of using Jenkins and Argo on Kubernetes. And today, we're going to go a level deeper, talk a bit more about scaling both Jenkins and Argo-driven CI-CD pipelines. My name's Kaelin. I'm co-founder and CEO at PipeKit, where we're maintainers on the Argo workflows project. I've been a contributor since 2021 on the project. And what we do is provide a control plane for Argo workflows for teams that want to provide self-serve workflows. And we also provide enterprise support. So you'll find us in the community slack on GitHub. And don't be shy. Feel free to say hi. Hi. Hi, everyone. My name's Bertrand. I'm a staff software engineer adding to it. I've been working on a cloud native and distributed system for about 20 years. I've joined Intuit about a year and a half ago as a member of the CI-CD team. My mission was to evaluate alternatives to Jenkins, such as GitHub actions or Argo workflows for our next generation platform. So today, the goals for this talk is, first, we're going to go over the challenges of running Jenkins on top of Kubernetes. Then we'll see an example of a N2N CI-CD pipeline, first using a hybrid strategy between Jenkins and Argo CD, followed by an Argo-only solution. And finally, we'll give you some tips and tricks on how to run those CI-CD pipelines and how to migrate from Jenkins to Argo workflows. First, a few words of introduction about Intuit. So Intuit is a financial software company. You may have heard or even used some of our products, such as TurboTax, QuickBooks, and Credit Karma, maybe not so much in Europe, but in the US probably. Our core mission is really to empower our customers to make the best financial decisions using our AI-driven platform. A quick look at the CI-CD landscape at Intuit. So we are serving about 6,000 developers, running 100,000 builds daily. And in order to support this, we are running a Kubernetes cluster with about 150 nodes with 200 Jenkins controllers and based on the load, given point in time, we're ranging between 1,000 to 1,500 Jenkins build agents. So we've been very successful with Jenkins so far. However, it has its challenges. So let's go over some of them. The first one is the major feedback we get from our customers, the difficulty of navigating the UI, especially when your build is failing. Then we're running the open source version of Jenkins. It doesn't come with disaster recovery or high availability features. So even though we have implemented our own custom solution, sometimes it can take up to an hour to fail over some bigger Jenkins controller which exceeds our SLAs. Then the challenge of running Jenkins at scale, especially the lack of a unifying control plane is something that basically makes deploying a new version of the plugins very tedious or new running out version of Jenkins very tedious. Sorry, and then we have the fact that Jenkins is not cloud native. So we can unlock in that execution model where you have one build agent and a pod with all these container running for the full duration of the build. And even if it's doing nothing, it's wasting your cluster resources. Now let's have a look at a typical Jenkins ACD pipeline add into it. So we won't go over the CI part of it as we already covered this in our previous talk, but we'll focus on the CD part of it. The first thing, those two steps, so you have the render manifests. So the CI part of it is like, first you build a darker image and you publish it to some darker registry and then the CD part of it is typically rendering some Kubernetes manifests and then trigger a deployment from them. Let me show you how it looks like in more details. So I into it for each core repository. We have an associated like deployment repository. The deployment repository is structured as followed. So you have like the main branch which holds all the templated Kubernetes resources. We use customers to customize those resources. And then for each environment, let's say Dev, QA, Prod, we have a corresponding branch which will hold the render manifests for that environment. So the way we're using this is when we render the manifest, we usually update the Kubernetes resources with a new image tag or whatever we need for that environment. And then we commit this to the target branch for that environment. So this is where GitOps shines. We actually are able to track all the changes and we know exactly what and when deployment has been triggered. Thanks to that. And finally, in order to trigger the deployment, we are using the, for each environment, we have a corresponding Argo CD application and we're using the Argo CLI to synchronize that application which will basically read from that branch and it will proceed to the deployment of your application. And the actual deployment of the application. Oops, yeah, there it is, yeah, thanks. So for the actual deployment of the application, we're using Argo rollouts. So this has been a key game changer for our deployment strategy. And I wanted to highlight today two key aspects of Argo rollouts which are the progressive delivery and the automated analysis and rollback. For the progressive delivery, so Argo rollout supports multiple deployment strategies including like blue, green, canary and rolling updates. We're using canary deployment side into it for all our services. And the idea for those who are not familiar with that is like you can have basically, can gradually instantiate, sorry, instantiate the new version of your service and route some traffic over to that service. And if everything is successful, you can like instruct, sorry, Argo rollouts to move forward. And the way it works, and this is the second point, it's like it has an automated analysis and rollback feature. So you can provide two Argo rollouts, some metrics. We support a large number of metrics providers, but we're using in that case, for instance, Prometheus with an HTTP success rate, but can be something else. It will instruct Argo rollouts, if everything is okay with the deployment, then it will move on and deploy more and more pods for your service and route more and more traffic to this. And luckily, you have the service fully deployed. However, if there's any issues with the deployment, with a new version of the service, it can automatically roll back to the previous stable version without having any human interaction, which is great. So this is what we use for all our services add into it. And this has been a game changer, as in we don't leave anything to luck and we have a full control and confidence over our deployments using this. Yep. And the last thing I would like to talk about is that some other colleagues that into it are working on the open source project, the NUMA project, which is an ML pipeline engine, if you will, and the kind of prototype and integration to Argo rollouts. So in the previous example, we were using the HTTP success rate, which is a very simple example, but you probably have complex applications with dozens or hundreds of metrics, right? And how do you make sure that you can actually compute an actual health score for your service that is meaningful when you deploy your application? This is exactly what they've done with NUMA projects or NUMA flow and NUMA logic. What they did is like they use an ML model to basically compute all those metrics, like HTTP success rates, arrays, latency, throughput, like you name it, like could be application level metrics as well. And they were able to out of those metrics, compute a score, if you will, thanks to the model. And this is what they used as a metric for the Argo rollout deployments. And this is something that we call AIOPS into it. And we are in the process of deploying that strategy for all our services, so it's in progress. I strongly encourage you to have a look at the talk that has been given at QCon 2022, where they go over this. And I think I'm done. I'm gonna hand it over now to Kellen, thank you. Yeah, I'll jump in now to more of the PipeKit use cases and some of the other scalability considerations we can share with you. So at PipeKit, we migrated off Jenkins for our CI to Argo workflows a bit while ago. And so now we're just gonna take a look at an example pipeline and a demo for you to understand how we use Argo workflows, Argo CD and Argo rollouts in our end-to-end CI CD pipeline. So this is a simple example here where we have a Git checkout, we'll then be building a container, publishing it and then deploying with Argo CD and doing a Canary deployment with Argo rollouts. And this example is built to be ran anywhere you'd like. You can run it locally, so you can check that out here on GitHub, it's our free Argo CI CD pipeline example. And I'll jump into that now in just a quick demo of how that works. So what we have here is, we're just gonna submit an Argo workflow that triggers a deployment incorporating Argo rollouts and what this is gonna do is spin up Argo workflows here. So here's our workflow and it runs through the Git checkout, it's gonna build the container and deploy that on Argo CD with the new image. That will then replace the existing app we had deployed already. So we can see now we have Argo CD running now and kicking off the initial deployment. And this here is our deploy manifest. So this is written as a workflow template that we have and one of the steps here is a Argo rollout. So we can see here deploying our resources and this is where we defined our canary deployment settings. So we have five stages to that and we're just rolling it out every 10 seconds if our metrics are looking good. So again, this is just kind of an example. We might actually run this slower in production but for the purposes of the demo we ran it just every 10 seconds and so we can swap back and as the pipeline progresses here, sorry, things are a bit slow. We will wait and see Argo rollouts come online and push the canaries live here. So now it's deploying. This is all in an Argo workflow and when we hop back here we see now our canaries are starting to come online and gradually roll out. So this just shows you I think a quick end to end example of how you can use a workflow to trigger Argo CD and then eventually a rollout over any span of time that you set if you wanna use a canary deployment. Now we'll hop into some considerations of how to scale up your CI CD pipeline with Argo. The first is maybe why you might consider using Argo workflows for your CI and the big reason is that one step, one pod principle and that really then enables you to odd scale faster, run parallelism by default and the big advantage that we saw is then we can scale CI to the limit of our Kubernetes cluster rather than the rough limit that we were hitting with Jenkins of around 5,000 jobs per Jenkins controller. So it's enabled a lot more scalability for ourselves and for our customers on our CI and we can run CI jobs faster as a result. Given, yeah, things dynamically are provisioned if there's a human in the loop in that CI pipeline we don't have to hold resources until they approve something to continue or we're waiting for a long step to run. We can just scale up step by step which has been a big advantage in cost savings and actually reduced our CI costs by 90% because we stopped holding any resources like we were with Jenkins and just auto scaling those on Kubernetes with Argo. The other big benefit is since everything is in one pod every step is in one pod you can take advantage of spot nodes and that was a big factor in our cost savings and Argo has a very handy retry framework available as a native feature. So you can quickly retry any step in your CI pipeline if that spot node gets killed by your cloud provider. Another thing I'll mention is just parallelism by default is a nice, really nice native feature in Argo workflows that in Jenkins we had to define and so using DAG templates on the Argo side we're able to just run any step automatically that doesn't have a declared dependency. So that's very handy for just spinning up parallel tasks running them as quick as you can and then continuing on the pipeline. Another thing as we saw is it's pretty seamless to integrate with Argo CD and rollouts. You can create workflow templates that different Argo workflow pipelines can call depending on parameters. So it really gives you the ability to remix and match your pipelines to the use case that you want without having to write everything from scratch each time. And the last thing that we find a benefit in is if customers are a Python shop there is a way to write your workflows in Python as well. You're not using a CI specific language like Groovy, we're using YAML which is what we're using for all of our other infrastructure or you're using Python through the HERA SDK. Next we'll dive into some scaling strategies and tips. So these fall into two categories. The first is work avoidance and the next is actually testing your pipeline before promoting it to production. On the work avoidance side there's a couple of tips that we have. The first is quite obvious but you know cache your images, that's a good thing to do. We unfortunately run into a lot of teams that kind of delay that step and it's just forever on the to do list and are wondering why their CI pipelines take a long time still after migrating and this is really what we find has the biggest impact on their pipeline speed after they've gotten onto Argo. We use Spagle as a tool so you're welcome to check that out but we just recommend caching with whatever tool works with your cloud provider the best. The second tip as far as avoiding work is to use the native memoization features that Argo has. If you're gonna rerun a pipeline you wanna be able to just reuse any steps that ran and didn't change and that's what memoization does for you. And what's great is a colleague of mine Alan and maintainer on the Argo workflows project did a talk on this with another maintainer from Intuit, Julie Volgerman and they talked in detail about how you can use memoization for your workflows. It works great for any data that's less than a megabyte so it can help you speed up pipelines and then there's additional strategies to use if a step is larger than that. So I recommend checking out that talk from the last ArgoCon. And then the last work avoidance tip we have is thinking about if you have CI and you're running it on commits which is what we do, every commit is gonna trigger a job. Think about how do you kill duplicative jobs if CI is still running for say a five, seven minute job and you forgot something, you're pushing another commit, you might not wanna keep running that old CI job. So there is a way to use config maps to track runs and enforce a certain logic on what runs of your workflows are kept live. We use the latest job logic so on the latest commit that run of the workflow takes precedence but you can also flip that around and use it for only the first job takes precedence. And the way that works is taking a look here at, let's see, this workflow here. This is just a quick example of how you might do that. It's basically done by just adding a simple template early on in your job that's creating a config map and naming that config map as your workflow with a parameter here. And then just having some quick logic here that says, okay, if that config map exists, let's kill the prior job with kubectl and if not, we'll create the config map so that if we need to run the job again, we know and we don't have to rework that or rerun that job and use those resources again. Again, like I said, you can flip that on its head and flip the if logic around if you want to maybe say keep a job live. We have some customers that maybe are deploying an app or streaming app and you might wanna keep that stream job live in case someone pushes another job to Argo, so it's possible to switch that logic around if you want. So that's it on the work avoidance side. Moving into testing, we have a few just basic lint tips for you that the handy thing with Argo is you can then use any sort of YAML linter for us. We recommend YAML lints, prettier, and then using mega linter to incorporate everything, all your languages into one linter that runs on every PR. That works great for us. The other thing to not forget is you have the Argo CLIs lint command, so make sure to run that. We run it early on in every pipeline to make sure we can pick up on any syntax errors that aren't covered by basic YAML linter. And then finally, there is a concept of taking a look at each step in your pipeline, and if you have a certain set of predictable outputs, consider testing for those, and that's a concept we call just asserting test cases for each step in your pipeline. It's something we covered in an Argo contact before by my colleague JP as well. So check that out, CICD for data pipelines. And the concept is to say if you're expecting a certain file output or a certain data type output in a pipeline, for a CI use case, we're often just testing for file outputs and paths that we're expecting. If for some reason those get messed up in dev, we wanna know about that before we push our CI pipeline to production. So we include a workflow of workflows pattern to test those workflow templates and ensure that if they were changed on that PR, we know they're not changing the expected output. As far as migrating, if you're considering moving from another CI tool to Argo workflows to build your full CI-CD pipeline with the Argo projects, we have a few tips. First is, you don't have to migrate all that once, so run workflows alongside Jenkins. That's what we did at first too, and you can trigger Argo workflows with Jenkins if you want. We used it for new projects, new repos, we see customers maybe take a new team that's more cloud native and start testing out Argo workflows with that team as their CI tool and not having to move the entire company's CI over. We started with simple tasks. Testing is great because you can easily parallelize it on Kubernetes versus using Jenkins previously, so that's something that we often recommend. People start using Argo for. And then the third thing we recommend is using workflow templates so that as you're building, you're not having to redo all that work over and over again for each CI pipeline. Start thinking of your templates as a library that you'll eventually give to developers to self-serve from and that'll make your life a lot easier down the adoption journey. And lastly, in the process of migrating CI and scaling up your CI, there are tools that can help speed things up as we're going to Kubernetes and using Argo for CI pipelines. BuildKit is one that we really recommend for builds and it works great with Argo. And then Spaggle is the other one I mentioned earlier. And that was for image caching. And finally, a little bit about PipeKit. So we're, as I mentioned, maintainers on the workflows project, also on the Helm project, and we're here to help anyone in the community with questions getting on to Argo, migrating to Argo workflows from a tool like Jenkins or Airflow, and we just help you save engineering time in cloud spend in the process. And we either do that through our control plane for Argo workflows if you want a self-serve experience or you provide a team that can be in your Slack and answer questions anytime you need it. As far as next steps, again, there's this free repo and resource we can give you. So definitely check that out on GitHub. You can pull that down if the internet permits. I had a lot of trouble earlier today with my demo, trying to do that live. So maybe pull it down, yeah, when everybody else is not trying to pull it down, but back at the hotel or something and you can play around with it, deploy it locally or on a Dev cluster you might have. Yeah, you can also come and meet the maintenance of Argo CD and Argo rollouts into it is a contributor and I think Zach and Michael are leads on those respectings I think. So come meet us at the Argo booth and we'll chat. Yeah, we'll be there as well from the workflows and Helmside, so hope to see you there tomorrow at the Exhibitor Expo downstairs. And if you wanna chat with us further about your use cases, happy to meet up this week. Thanks a lot. Thanks a lot.