 Hello. Hi, everyone. Welcome to ArgoCon and KubeCon. Happy to see you all here. So it's time to start our presentation. As Nick said, we're going to talk about lessons that master class learned from implementing preview apps using Argo CD. And so let's do a quick introduction. So my name is Alexander Matyshenssev. I don't work at master class. Actually, I work at a company called Acuity. It's a company that's founded by Argo creators. And I worked with Paul to implement the solution. So I let him introduce himself. I'm Paul Phipps. I'm on the platform engineering team at master class. And yeah, excited to share some of the lessons we learned implementing this project with Acuity. All right. And so as you can guess, Paul did all the hard work. He figured out how to use Argo CD to implement the preview apps in the master class. And so my job here is to give you the basics. Some things we need to understand before we deep dive into implementation details. And so the reason I even have to give this kind of context is Argo CD itself do not have feature called preview apps. But it's very flexible. And it consists of lots of building blocks that you can use as well in your organization to implement preview apps use case. And so I want to talk about the building blocks and then more about the use case itself. And then Paul will take over and talk about the implementation details. And so the first building block that we need to understand is Argo CD application. And so I want to highlight a few important parts about application. So first of all, application is a custom resource stored in Kubernetes itself, usually in Argo CD namespace. And it's a configuration setting that explains Argo CD where to get manifests and where to deploy those manifests. So if you see this YAML snippet on the screen, that's Argo CD application. It has very few fields that you must provide. One is name and two parts, which is source and destination. And source is just a folder in the Git repository. Destination is a namespace in a cluster. So it's very simple. And if you ever went through Argo CD getting started guide, you know that it suggests you to use a user interface to create an application. So literally click the new application button, fill in those details, and application is there. It's available. You can deploy manifests, deploy Kubernetes resources. But I described it to highlight that even though it's simple, we do not recommend to do it this way in production. And the reason is even in the simplest possible scenario, you are going to have more than one Argo CD application. And the spec might be a bit more complex than just source and destination. So on this slide, I have the simplest possible real life scenario I could come up with. So imagine if you are a small team, if you represent a small team and you manage a single microservice, and you want to actually deploy it in production using Argo CD, most likely you're going to have three or four Argo CD applications, one for each environment. So in this case QA stage and production and even couple production clusters. So that's already a bit more complex than four applications and so that one application. So if you are still not convinced, here are more use cases. One of them environments, we just talked about it. So add-ons in a cluster. This is a very popular scenario for platform administrators. So people use a lot of Argo CD to deliver core components into all the clusters across the organization. So and in this case, you need to create maybe hundreds of applications. They all look the same, almost, but there is a lot of them. And preview environments, that's important use case especially for this talk. So let's say you want to see your application in action before code is merged into the source repo. So Argo CD can help you there as well. It can help you to orchestrate provisioning of your application with the image built from the code which is not yet merged and Paul will describe it in details very soon. And so last kind of bit, if I still have not convinced you, here's just more details about those three top use cases. So even environments are usually more complex than it sounds and it depends on the config management tool that you are using. So and I guess one example, if you're using Helm, you need to explain to Argo CD how exactly manifest should be rendered. And at least you need to provide environment specific value file. Yeah, and cluster add-ons, it's a whole big story. It deserves a second talk, but you basically usually you want to create a set of applications for each cluster where set is inferred from a Git repo based on a list of files or directories in that Git repo. And preview environments is highly opinionated. It basically there you need to get list of pull requests from your Git provider. And so hopefully I did a good job convinced you that a solution is required to automate application management. And so here's one, here's one solution that existed in Argo CD from day one from the moment of its creation. So and it's called app of apps. And I'm going to kind of briefly describe it and hopefully you know what that means, but app of apps just refers to a pattern. It's not a feature of Argo CD. And if you remember on the first slide, I mentioned that Argo CD application is a custom resource, which means Argo CD itself is a perfect tool to deliver that custom Kubernetes resource into the cluster. And this is basically it was pattern was born in a community. So our users figure out they can create, they can create a Git repository, store YAML files with application manifests in that Git repository, then use Argo CD to deliver those applications into Argo CD namespace. And the next step is they figured it's possible to use templating and save a little bit of manual work. And so you can use for example, Helm to template ties your Argo CD spec. And just get list of applications from the value.yaml file. And it works perfectly fine. Actually, it's been used in production by many organizations for like years, but there are kind of caveats. It's still not perfect. So now we are getting close to kind of, to the final solution. Oh, the caveats are basically, you still have to maintain the source of true manually. So you need to maintain list of your applications and properties of those applications in a value file. And if you want to do it dynamic, it's just not possible because the data that you need to actually get list of your applications is stored somewhere else. So if you are managing environments, you, the source of true of your environments is a list of customized bases in your Git repository. So let's say you have a single app, you have QA, broad and staging environments, and you most likely have a separate folder with customization settings in your Git repo. And this is enough to create list of applications, but you are forced to update values file accordingly to let her know about it. And add-ons and preview environments are even more complex cases because you literally need to get data from different sources. You need to get a list of pull requests, metadata of which pull request. You need to get a list of all the clusters that connected to Argo CD, and there is no way you can package all of those capabilities into generic config management tool. And that's why we come up with, I think we think better solution, application set. And I want to describe what the application set is, and then Paul will explain how to use it to organize preview apps generation. So application set is a custom resource and its sole purpose is to automate management of Argo CD applications. And first it has overlap with Helm, it provides templating. So you can define template of application spec and just reuse it for the data that's pulled from somewhere else. And so the next most important feature is capability to pull that data. And to make it hopefully more easier to understand, here is application set that produces a single application for each and every cluster configured in Argo CD. And so hopefully, okay, you cannot see my mouse, but the property called generators is the source of the dynamic data. And in the generators definition, you can provide various sources. In this case, it's just the clusters. So it's a fairly simple application set. It iterates for all the clusters in Argo CD and produces a set of applications. And every time there is a new cluster, it reruns again and produces a new set of applications. Okay, that's all the theoretical knowledge we need to know, now it's time for Paul to describe it. Thank you, Alex. So that was a great foundation, but let's talk first, why preview apps? There can be significant differences between our local dev environments and production environments, and our goal was to mimic production as closely as possible in order to reduce risk. We use our preview apps for both feature validation and performance testing in order to accomplish that. Another issue is that feedback collection can be difficult without preview apps. There can be hardware, network, and security issues with sharing your local dev remotely. Our preview apps were meant to allow for easy remote collaboration between team members. And finally, the local development experience is becoming increasingly difficult for us to support as we add new applications and services. While we haven't completely broken down our monolith, we built our preview apps to be able to scale with the new applications and services we're building. Let's start with how we're using app sets. Having an application monorepo made it an easy decision to use a single app set for all of our applications. Multiple pull request generators could be implemented to accomplish this if you have multiple app repos, or you could also use multiple app sets. But it can be difficult to accomplish messaging back into the pull requests from the app set as the context of which generator triggered the preview app requests isn't shared downstream. Pull request generator also doesn't currently support path filtering based on paths in your application repo. To solve this, we found the best solution was using pull request labels and using a CI process to add a PR label to our pull requests when specific application directories are changed. We also implemented a matrix generator so that we could add the cluster generator here. That was to simplify migrating the apps to a new cluster whenever we needed to make a cluster replacement. The pull request generator provides multiple template values that provide context from the pull request that can be provided to the application. Here's an example here. Here we're using customize. You could probably also use helm. We didn't explore that use case since we're already using customize for all of our applications. We provide the application or we provide all of the generated resources, a name suffix with the pull request number, some common annotations and multiple app images enriched with the short shot. We also specify a namespace in our destination section with the PR number so that all generated resources will be launched in that namespace. From there, we needed to get all of that context into each application in a useful way. To do this, we used customize multibase to reference resources in each application directory in our GitOps monorepo. We also referenced some special tasks to provision our data sources and seed them with mock data in some cases and we have some third party integrations that need to be configured. One of the biggest challenges we faced initially was getting the pull request context deeper into the app configuration, including being able to configure ingresses, deployments and jobs. The solution we found with acuity support was customize replacements. With replacements, we're able to source the value of a common annotation and a resource and replace a target value with it. In the example here, we use a delimiter to prepend a subdomain on an existing root domain and another example where we replace the name of the namespace that's created. It might be a little bit easier to see them at work here. We're using a delimiter replacement in three cases here to create host names and a full value replacement in the case of the target service for this ingress definition. We use replacements in many places through our preview apps and even use them to set application environment variables that require context from the pull request. Another RGO CD function that we make significant use of is sync waves. We use negative numbering to ensure that if newly added resources are added without a defined wave, they will be in the final wave. We initially provision the essentials for our apps, including running, including our namespace, secrets, RBAC, and data infrastructure. We also provision the essentials for our applications to run, including mocking data, and then we run more routine jobs that include new migrations or refreshing our experimentation cache. And finally, we have deployments, services, and ingresses that are provisioned. We had some tough decisions to make when it came to our data infrastructure. I'm still second guessing our approach. Crossplane is great for provisioning cloud resources. These are similar resources to what we use on prod and it supports multiple use cases. The issues we've seen are slow provisioning time, occasional cloud provider API rate limits, and higher costs than alternatives. The next option we considered was using centralized shared cloud resources and using namespacing to differentiate between applications. This provided the same cloud resources we're using in production with better performance and far less provisioning time. The issue is that some data stores don't support the concept of namespacing at the framework level and managing the lifecycle of that data adds complexity. Finally, we considered deploying our own resources as that would be the least expensive and provide fast provisioning times. There was a higher risk of seeing differences in performance from production and configuration of each platform would have to be maintained. In the end, we went with crossplane for some use cases and shared cloud resources for the use cases that supported that option. If I were to go back and change something here, I would likely self-deploy some of our resources rather than using crossplane. An issue we discovered along the way is that crossplane will report healthy to Argo CD as soon as the API provisioning requests to the cloud provider is successful. This meant that our sync wave would move on to the next wave before our info was ready and our jobs would start to fill. Alexander helped point us in the right direction with custom health checks to resolve this issue. We also had a decision to make when it came to secrets. External secrets operator is great as it integrates with the top secrets management tools and automatically refreshes secrets when there are changes. For preview apps, it's less of a good fit as it requires a one-to-one relationship with the object and the secrets manager. And after multiple rounds of testing, we found that in every scenario we tried to decouple that behavior, changes to our local secrets would be overwritten by the operator eventually. So we ended up using external secrets operator to sync a secrets template to a unique namespace and then created a job with the RBAC needed to copy the secrets values into our preview app namespace. This solution gave us the benefit of devs being able to update the templates with tools they're familiar with. And every preview app created after those updates will pull the latest values. And just to show how simple some of these jobs can be, we just created a shell script with some kubectl commands to copy our secrets and you can use the Alpine Kubernetes image to run it. We looked into two different options for building images. Building NCI would give us parity with production. However, this option, the build context isn't available in Argo CD. The other option was to build the image in the cluster. This would provide the context in Argo CD. However, a failed build could be retried continuously if you're using self-healing and it would be difficult to provide the same visibility into build errors as a robust CI system. We decided to use CI to build images and quickly found an issue when the sync wave is unaware of a build failure. Migration jobs would end up in error image pullbackoff and the sync would not fail. What's worse is that when devs would push a fix and their build would succeed, the app set would not sync those new changes as it was waiting for the previous sync wave to complete. There is no time out for a sync so they will run until manually interrupted even if the app is marked for deletion. This led us to our current design which where we added in a NIC container to any jobs that require an image we're building. The job will continuously be retried through self-healing. However, that process is very lightweight and very inexpensive to run. We implemented jobs in two different ways. Jobs that run only once for tasks such as mocking data and jobs that run on each sync for tasks such as migrations. We used the sync options annotation to set replace to true for jobs that should run on each sync. We found a big issue with one-time jobs when we were initially making changes to common annotations in the app set. Since the new annotations would be applied to a job that's already successfully run, they end up in an out of sync state. This also meant we couldn't use any app set templates in common annotations that would change such as commit shaws. We made sure to avoid that situation early on but the only way we found to resolve it when we hit the issue was wiping our data sources for all of our preview apps and rerunning those jobs. Never an ideal scenario. We also found there was no option for a deletion hook but we'll get into life cycle management a bit later. We found the easiest way to set up networking between our apps in our preview app namespace was to use customized replacements to set Mvars in our deployments. We also want to expose these apps so their teams can collaborate together with limited friction. External DNS creates a public DNS entry that is continuously updated with a cloud load balancer endpoint if it changes. We also expose the app status and critical information back to developers through updates to the pull requests through Argo CD notifications controller. Here's an example of some of our notifications templates. Here's what the top one is where we're adding a pull request comment through a webhook. We use quite a few template values here so that we can use all of our pull request data that we need to ensure the comment ends up in the right PR. We also craft unique URLs for multiple services that simplify access to the preview app including giving access to the app dashboard in Argo CD and the app logs. When the collaboration and PR reviews are done and our work is merged or the pull request has gone still, we need to tear everything down. We currently use cron jobs to accomplish all of our lifecycle management. One area we may consider in the future to keep our solution more native to the app set would be finalizers. This would require creating a controller to identify specific resources that have been finalized and then complete the deprovisioning tasks before deleting those resources. We don't have much incentive to pursue this path currently since cron jobs are sufficient but future needs may make this path more appealing. We also use other services in our implementation. I won't go into any detail on these but if you have questions I'll be happy to chat with you after the talk. Now for the results. Was this a valuable effort and how has it impacted our team? So far we've served over 1.8 million requests and our metrics show that over half of the apps we generate are used by our team for testing and validation. We've received very positive feedback from our devs and have only spent a few hours maintaining these over the past six months. I sourced some comments from our developers this past week and here's a short summary from their feedback. As you can see our developers have found preview apps to be an essential and time saving part of their workflow. They helped de-risk big changes since synthetic testing and PR reviews can start early in a production like environment and visibility into issues can be easily accessed from the pull request. Our primary primary complaint is the build time. Since we build on each commit we're currently working with our teams to improve those build times and looking into hot reloading as an option to consider. Let me see if I can get a demo to work. Gonna do an actual demo of the product. So here's our homepage. I think we can make some improvements. I spun up this PR this morning. Get my comment, my preview app is ready. Can also double check. Let's go see if this loads. Learn from the best at ArgoCon. I think that's a big improvement. We have about a minute and a half for any Q and A. The mic up at the middle there if anybody's interested. So how did you guys deal with data parity? You wanted to say you wanted to keep production like data all the way down to your preview apps. We've had some problems with data sanitization being able to keep the database up to date and then with Rails migrations that adds another complexity aspect to it. So I was wondering if you guys used a tool to sanitize the data or you just seeded a blank database? We already maintain those kinds of scripts where we run a sync from our database and truncate it and anonymize it every night. So we already were doing that for local dev. So we just use the same sync for these apps. Cool, and then like, so you're maintaining the single control plane with namespaces as your like multi-tenancy aspect, essentially, or is it like one Kubernetes cluster per, how is it? We have a single cluster with multiple and each unique namespace is named by the PR, so the PR number. Okay, cool, thank you. We have time for one more question, one quick one, please. Yeah, quickly, are you using the PR preview for like cluster add-ons like cert manager or anything like that? Can you repeat that? Yeah, are you using the PR generator for like cluster add-ons like controllers or anything like that? We aren't using it for anything, any of our microservices, we're only using it for applications right now. Okay, and have you had any problems with, I'll stop, I said one. Thank you. I'm happy to chat with you after. Yeah, they'll be happy to chat in the hallway.