 Hi everyone. I'm Andy. I'm a software engineer on my team. And I'm also very excited to be here, especially since it's my first ever KubeCon. Together we are going to take you through our journey of improving the life cycle management and provisioning of add-ons across thousands of EKS clusters in hyperforce using Argo Workforce. A little about Salesforce, we are proud to be the number one CRM company. We are also now the world's largest enterprise apps company. In addition to being a category leader, we are also a leader in philanthropy, addition, innovation and culture. We believe that business is the greatest platform for change. And one of the ways we use our platform for change is through our 111 philanthropy model, where we pledge 1% of our time, equity and product to giving back to our communities. A little about Salesforce and hyperforce. We'll hear this word a lot in the next in the rest of the slides. Hyperforce is our trusted cloud platform. It is the infrastructure built in public cloud, which is the foundation of everything we do on Salesforce. Hyperforce is secured by default. And it provides comprehensive privacy standards that help you get control and transparency of your customers data. A little introduction about our team. We are the hyperforce Kubernetes platform team. We provide a multi substrate and completely manage Kubernetes platform integrated with the rest of the Salesforce infrastructure. Our value propositions are we provide 24 seven support for our clusters and integrations. We manage all of the Salesforce infrastructure integrations into Kubernetes, including certifying a new release, continuous synthetic monitoring and safe rollout. Maintaining trust is our number one value at Salesforce. We help maintain security and trust by running the most up to date secure and patched version of Kubernetes in hyperforce, including integration versions. By solving problems centrally in a platform, we avoid dedicated cluster management headcount, which based on an internal surveys around one and a half engineers per service team on an average. Our mission is to allow service owners, which are other team members team at Salesforce to focus on the unique value of their services without having to focus on infrastructure. This increases developer agility while decreasing operational costs and complexity. As a result, we need an efficient scalable and resilient system for managing our clusters and cluster components to achieve our mission. So what are add ons? Add ons or Kubernetes integrations in our internal terminology are the glue that makes these integrations possible. They're Salesforce specific services that are running in every hyperforce cluster. Some examples are mutating webhooks for injecting side cars for certificate management and secret rotation, even set parts for logging and metrics and controllers for patching nodes. We have around 19 total integrations and the list is increasing. We chose the name add ons for this talk because it's a more well known term, but this should not be confused with EKS add ons as shown on the right. In the rest of the presentation, we are going to continue to use the word add ons to mean Kubernetes integrations as defined in this slide. As an industry, we all want clusters as catalysts, but how many of us have been able to realize that dream? Clusters as cattle make sense for certain scenarios, but there are gaps in our tooling in our CNCF landscape that enables many large organizations to run clusters as pets, not smaller clusters, not bigger, but average sized and thousands of them. In order to manage these large number of pets, we need to improve the life cycle management of these clusters and the add ons deployed on these clusters at scale. That is the growth phase our team is in to support the next phase of our business goals. We're going to focus on add ons today, which this talk is about. Add ons need to be updated by their service owners and rolled out to thousands of clusters. How do you do that efficiently? By making the add on owners happy, by improving the developer experience of these add on owners, by giving them a faster inner loop and a faster and more reliable deployment pipelines. In our industry, we talk about service ownership and production workloads, but we rarely talk about the CD pipelines that deliver these workloads to production. We need to start treating the CD pipelines as production workloads itself and the CD pipelines that deliver add ons to our clusters are no different. We need to measure their SLIs and optimize and simplify them so that they meet their SLOs. We also need to make the task of measuring and optimizing the SLIs of these CD pipelines that deploy these add ons easier by taking advantage of container native workflow systems like Argo workflow or Tecton, which are very well integrated with the rest of the cloud native ecosystem. They are natively built to run on Kubernetes and can take advantage of all of its features. Lastly, as the number of clusters keep increasing with business demand, a central CD solution is not easy to scale. So, we need to offload the last mile delivery of these per cluster add ons to a more scalable model that scales horizontally with the business demands. We will be looking at our current solution from these angles for deploying add ons and then describe the solution we are embarking upon. This is our current system for managing clusters as pets. If you took a 100,000 foot view of our deployment pipelines, it would look something like that. Infrastructure in terms of cluster and add-on version is deployed in Git repos as shown on the left. Spinnaker pipelines run across thousands of EKS clusters to deploy these add ons in a staggered and health mediated way as shown on the right. In practice, this has many dimensions and looks something like this. This is a visualization of our Spinnaker pipelines used to deploy an EKS cluster and the add ons on top of it for one environment. The rectangle represents a Spinnaker pipeline and a line represents a call to another Spinnaker pipeline. The nesting is deep. If you were to take this picture and multiply by the number of regions, number of instances of those regions and the number of instances of each business unit in those regions, the picture easily becomes unfathomable. The Spinnaker UI starts to show signs of its limits here. We will be focusing on a subset of these pipelines that are used to deploy add ons. Now let us look at our current service level indicators for these add-on pipelines. The add-on owner's inner loop velocity can range anywhere from 5 minutes to 30 minutes depending on the complexity of the change and the performance of the CI CD pipelines. The number of Spinnaker pipelines used to deploy these 19 add ons to a single cluster is today around 90. If you were to run these pipelines across all EKS clusters, it would roughly take 300,000 executions of these pipelines to deploy them across all environments. That also gives you a hint of the number of clusters. It takes an hour to finish deploying all add ons to a single cluster and up to three weeks to deploy them across all EKS clusters in the positive case. This is working so far, but as we continue to add more clusters, we really need to improve these SLIs. Let's summarize what our SLIs and our customers are telling us about the problems we need to address in deploying these add ons. The developer experience around managing add ons is lacking in various areas, like no inner loop, hard to learn and maintain templating, complex UI and slow and complex. Debugging is harder due to the sheer number of pipelines and nesting we saw in the previous slides. We need to make our CD pipelines faster and the UI more responsive. We also need to be able to process more number of workflows at the same time to handle more clusters. The extensibility of Spinnaker is limited, especially when we are using it to deploy to a single cluster. It is hard to make these pipelines resilient in the face of failures, especially due to the lack of any kind of native retry capabilities. Additionally, there is not enough visibility for our add-on owners to debug their failures and optimize their execution times quickly if we have to scale this centrally managed Kubernetes platform and get closer to our vision of efficiently operating clusters as pets while continuing to maintain trust for our customers. We really need to do better in all of the above categories in the slide. So what is the solution to all of these problems? I'm going to hand it over to Andy for walking us through the solution. Thanks, Mike. Our solution is to offload the last mile delivery of our add-on provisioning process to Argo workflows. Argo workflows is a container native CNCF incubating workflow engine that you may have heard of at some of the other talks here at KubeCon. As a reminder, in Argo workflows, each workload you're executing, where each execution instance is defined in a CRD instance called a workflow, and usually you define the reusable templates or stages, as I like to say, for use in your workflow in other CRD instances called workflow templates. So essentially here are spinnaker pipelines for provisioning each add-on become Argo workflows. As you can see right now, spinnaker will centrally handle the add-on provisioning for each of our add-ons in each of the environments for the hyperforce managed clusters on the bottom. It will promote it through the dev environment and then through tests and then through higher environments. What we are doing now is we have Argo workflows in the mix. We still use spinnaker to trigger the provisioning process because it's integrated with our internal infrastructure registry but spinnaker will hand off the workload to a Argo workflows controller running in a control plane cluster per environment. This has several implications. The first is that we can scale across our environments more effectively. We even have the option to shard up the number of control plane clusters if we need to handle more pods for all the workflows in the environment. In workflows, stages are usually run under hooded by pods. The next one is that we can make use of a lot of the features of Argo workflows to extend or improve on the provisioning processes and resolve the problems that Maya mentioned on the earlier slide. And it also helps improve our developer interloop experience as Maya mentioned because they can now use familiar Kubernetes tooling to update their provisioning processes like using kubectl to test and apply changes. So let's take a closer look at this slide which shows the setup in the dev environment although it also applies more broadly to our other environments. We have an Argo workflows controller running in a EKS cluster or shards if we need them. And it's tied to a Amazon RDS or Aurora DB database in S3 bucket. This is because Argo workflows stores the status of each workflow execution in the status field of the CRD instance. So when you have a lot of particular stages or clusters and add-ons, they can get rather large and threaten to exceed the at CD database size limit in the cluster. So we use Argos in build functionality to offload workflow statuses and archival workloads in the SQL database. And S3 bucket is just there for any artifacts that we may need to store in the process of running a workflow. We have a library of workflow templates installed in the EKS cluster. Each of these defines the process for provisioning one particular add-on or defines helper functions like for running Helm and Terraform that can be referenced in those add-on provisioning templates to actually trigger the provisioning for a particular cluster. As mentioned before, it's still triggered by Spinnaker but Spinnaker will pass on parameters for which add-on and which cluster you need to provision and provision in. And Argo will construct a workflow from the templates and run Terraform and Helm to provision resources for that add-on in your cluster. So on the next slide, we see that the use of workflow templates does help us with versioning our add-on provisioning processes and releases. So because workflow templates are YAML objects in the cluster like any other Kubernetes object, you can store multiple versions of them and differentiate them by name. For example, you can suffix them with release one or release three. When you submit a provisioning request through Spinnaker, you can choose which release is being referenced. So if we usually use release three but we find out there's a bug, we can roll back to release two very easily by changing the reference. It also allows our add-on owners to more easily make changes to their provisioning process as defined in the workflow templates. We have a Git repository where the workflow templates are stored and add-on owners can check out from that Git repository, make changes to the definition which is just a YAML object as workflow templates are defined. And then QCTL apply the change, name it, say, test, for example, to differentiate it from the releases, and then create a workflow that references that test template with their changes and execute it to make sure that it passes. Once they're done, they can push their changes back to the Git repository and submit the PR to merge it in. So this is improved over our previous process because previously add-on owners would need to either update a pipeline template repository for Spinnaker pipelines and then wait for the changes to be generated before testing or they would need to update it in the UI and copy their changes over and make sure nothing goes wrong in the process. So now they can just directly edit in this repository and keep QCTL apply and create. And don't worry, we have controls in place to make sure that this testing is just in the dev environment for these test clusters. It doesn't go past that and have any security issues from that. As for why we chose Argo workflows specifically, there are a number of features, some of which I'll just go over in the slide. The first one is that there is overall feature parity with our current implementation of the Spinnaker pipelines, which is important for allowing us to migrate over more smoothly and more easily. Argo workflows, for example, has support for different stage execution ordering based on your directed acyclic graph trees or step lists. It has support for conditionals and expressions based on your input and output parameters. It supports them in parallelism in case you happen to run two runs of Terraform at the same time and they conflict with each other and you don't want that. And in fact, it's more extensible as well compared to our pipelines. You can do things like define a script that you want to run in the container directly in the workflow, which can help us with refactoring anything we need for better reliability and performance. And of course, Argo workflows is also Kubernetes-native, which helps with the versioning and the testing as mentioned earlier. It's more familiar Kubernetes tooling for teams like us and for add-on owners who are familiar with Kubernetes. But it also means that Argo workflows can natively support things like emitting Prometheus metrics so we can get better visibility into workflow execution time or workflow error counts. It can do things like set pod disruption budgets on particular stages to help guard against evictions. It can do things like the next important feature for us, retry failed steps in case of transient errors. In our Spinnaker pipelines, we do have a few of those. And you can even extend it even more. Argo has native support for things like retrying failed steps based on the exit code of the previous step, which means we can better improve our reliability with these extended features. Argo workflows also has memoization and caching support, which we're working on implementing right now to help improve our performance time, which as you saw, could use a little bit of improving. Argo workflows also has been immutable and has simplified nesting of all the nodes in your workflow, which as you saw was an issue with our Spinnaker pipelines. When you submit a workflow to the cluster, Argo will look at the template stored inside the cluster and store them in the status field of your workflow. So they'll be immutable. What you expect to run when you submit the workflow will be what's actually run as opposed to the pipelines where a change in the middle might be reflected in your current execution. And the simplified nesting is great for visibility for us by simplifying down all of those child pipelines into one easily viewable object. This is combined with the UI for us to let us see all of the stages and any failures for a particular add-on in one place. And the UI also has other features that want to use like filtering out executions by labels. So you can see all of the executions for one add-on in all clusters for an environment or all add-ons for one cluster or any combination, which improves our visibility when we're doing things like releasing a new version of our add-ons across our clusters. So we are still in the process of converting all of our add-ons and releasing this change across all of our environments, but these are the preliminary results we have so far. Four of our 19 add-on provisioning pipelines convert workflows right now and it's resulted in simplifying 35 Spinnaker pipeline executions, which are the parents and children, into four Argo workflow executions, one for each add-on, which helps with performance and with visibility a lot. No need to look through child pipelines to see the entire add-on provisioning process. There's about a 20% improvement in add-on provisioning time so far. Our cluster logging and metrics shipping add-on has gone down from about 20 to 16 minutes to provision it in a new cluster and there is about a 50% improvement in the interloop velocity for making, testing, and pushing a change in the dev environment due to the pipeline templating versus Kube CTO and YAML changing flow that I mentioned earlier. So we're still converting the rest of our add-ons and promoting across all of our environments and we'll keep measuring and looking for areas to improve as we do that. But now I want to hand it back to Mayank to describe some of our overall learnings from this process. Thanks, Andy, for zooming into our Argo workflow-based add-on pipelines and sharing some early results that definitely look promising. As promised in our abstract, I wanted to summarize the steps that we are taking to enable offloading of add-on provisioning for each EKS cluster to Argo. You can consider this as a recipe for enabling similar scenarios in your organization. We first enabled Argo workflow controller provisioning in each environment using Spinnaker. We also enabled archival and offloading for Argo workflow CRD status to Aurora DB for scaling purposes and for HCD limits, as Andy mentioned earlier. In order for our control plane clusters to be able to deploy Terraform and Held to various other clusters in each environment, we had to enable trust and permissions for our control plane cluster. And we also had to take security approval and fix any bugs that were found in that process. We haven't enabled sharding yet because our back of the envelope calculation shows that a single-hunt cluster can handle all the workflows for a given environment. But we are making sure we have provisions in the architecture to easily enable this as soon as we need it. In order to make the add-on owners life easier to write and maintain their pipelines, our platform engineering team wrote a set of workflow templates, a reusable workflow template that we identified for the most common scenarios and made them available to our add-on owners as a library. At this point, we were ready to start asking our add-on owners to begin converting their Spinnaker pipelines to workflow templates. In order to provide an example and a conversion guide, we even took the work for one add-on owners on ourselves and finished the conversion and provided that as an example for other add-on owners to follow. One key thing that required our Spinnaker team's help was to integrate Argo workflow processing into Spinnaker. Spinnaker already knows how to deploy Helm charts and Argo workflow templates are just another Helm chart. But Spinnaker also needs to know when the Argo workflow template processing is complete so that it and whether it failed or succeeded so that it can decide whether to move on to the next Spinnaker stage or halt. This process was completed and we are going to use now. We are also working on creating Grafana dashboards for measuring per stage and per workflow metrics so that we can give our add-on owners an easy way to develop, iterate, and optimize their workflows. The last two are very much work in early stages of the project, but we believe there is an immense opportunity to optimize and scale each workflow. Since each workflow is essentially just a Kubernetes part. So for example, you can give more resources to a pipeline or a workflow executing in Argo or execute them with higher priority or not execute already evaluated executions or if you have identified there are certain special pipelines that need special processing, you can isolate the execution of these workflows to special node pools or instance types. The last one is we've already talked about the native scaling due to resource and priority. There are many more steps we have left out, but hopefully this gives you a fair idea of what it took us to enable for a scale of like ours and we are barely getting started. Also wanted to give summary of all the key takeaways for us all as a community so that we can all learn from each other and grow stronger together. The first one is to focus on your organization's development experience, measure and improve your service owners experience in maintaining their service and make it really easy to deploy them to production in a fast and reliable way. At the same time, make it really hard using tooling to make mistakes for these service owners. Our service owners in this case are add on owners. That's what we're doing for them. Second one is treat your CD pipelines that deliver your services to different environments as production services itself. Measure their SLIs, publish their SLOs and alert on them when they degrade or go down the equally important assets of your organization. As you saw in our previous slides, we focus on the CD pipelines that deploy add-ons and we are actively working on improving their SLIs. Third, adopting container native workflows engines like Argo and Tecton into your existing developer tool chains will make it easier and faster to deploy your services with confidence. You can think of these container native workflow engines as mini developer work platforms that can help unlock a lot of productivity for your teams. As Andy showed in the previous slides, we are already seeing better execution times, faster in a loop and a lot of additional benefits that we could not even imagine reaping from our old system. And the last one is the service owners have widely adopted the Kubernetes resource model and various other cloud native open source tools for managing their workloads in production on Kubernetes. Their existing development tools like kubectl, OPA, Helm charts are now being leveraged for various newer scenarios like managing their CD pipelines, managing different tasks that are exposed by various Kubernetes operators doing inner loop. Essentially what you're doing is meeting these developers where they already are instead of giving them new tools and forcing them to move to proprietary solutions. Familiar industry standard tooling is easier to learn, maintain and hire for and most importantly it also improves the ecosystem. We believe that Kubernetes resource model is the language that will bind the next generation of industry standard tooling. With that I wanted to say thank you for coming to our talk and giving us a chance to share our journey towards a container native CD solution using our workflows. Hopefully you got something useful from our talk. We will have a QR code on the next slide where you can provide us with feedback so please go ahead and do that. We'll be taking questions now and we hope that you have a great rest of your day and a great weekend. Thanks everyone. I wonder if anybody has a question? Internal developer platform. Thank you for the two questions. Is there any reason you're not using our Argo CD and then probably like apps that you're deploying into the clusters, are you guys doing the validation that they're actually healthy? I don't know if I heard the question properly but was your first part why are we not using Argo CD? Yeah okay. Or are you just using like a just a standard container just running home directly? Okay so yeah I think a lot of this is related to how our current infrastructure is built. It has a lot of spinnaker that is already very deeply integrated into the rest of the infrastructure and a simpler answer might be that our environments do not have access to get and there's an intermediate step that happens so it is not easy to integrate Argo CD and Argo workflow just fitted the bill in terms of like being a generic pipeline that can model whatever we were doing in spinnaker. Does that help or? How long does it take you to migrate from some of the older workflows that the horrendous graph that we saw earlier into the Argo workflows roughly per add-on? How long does it take you all the government teams to get ready for production? So I think for the online audiences is the question how long does it take for us to migrate the add-ons from spinnaker to Argo workflow and reach production? Yes. So I think we are very early in that stage. We haven't made it to production yet. We are in dev environments. We are like you said like we said we are in the process we are providing tools conversion guides and like how Andy showed we have converted four add-ons to spinnaker pipelines. I think what would you say is a like after everything has been identified the amount of time to just convert would probably not be more than maybe two sprints or one sprint? She won the two weeks but I speak as someone who's worked with this a lot. I think a lot of it would probably be just from getting a bit more used to syntax versus when you've been working with spinnaker pipelines for however many months or years you're more familiar with that. A lot of it does translate over pretty smoothly though so I wouldn't say it's that much effort to get used to the syntax. I think the last part is that yes we also need to integrate this with spinnaker and transition it while the ship is running right so we need to migrate these add-ons from spinnaker to Argo while the whole ship is moving so that is going to take a few releases I believe. We're out of time feel free to follow up for it the speakers and then the other questions you might have. Thank you everybody for coming and thank you for presenting. Thanks yeah we'll be hanging out around the room if you if anybody has more questions.