 My name is Larissa. I'm a software engineer with the Adobe Experience platform group I've been mostly developing backend services. So I'm an application developer, but I've always had an interest in automation and Just making it easier for application developer to experiment new stuff. And I have here my colleague Hello, hopefully you can hear me. My name is you know, it's I also work for Adobe as an architect wanted to say that I'm really psyched to be here and Had some had the chance to witness some really great talks at KubeCon So today, we're here to talk about some of our experience with GitOps at Adobe. Take it away our story towards GitOps started in 2020-2019 apparently with this migration to Kubernetes Then but this is also true today Teams could get direct access to clusters So they started bringing and maintaining their own deployment tools such as spinnaker and Siebel Jenkins Maybe you had this problem also if you're coming from a big company I at least heard it this week a few times that people are consolidating their CICD on top of a single text text So Adobe wanted to do the same. So last year this migration towards Let's say Kubernetes native patterns and tools such as GitOps and Argo CD started The thing is that before Argo CD we've built over and over fully automated deployment pipelines It would deploy to production in a safe quick and recoverable manner And we kind of developed those practices and we wanted to take those practices with us in this GitOps journey So I'm going to spend some time Going a bit to through some practices that are foundational for our For our CICD and then we will go through the main let's say topics of this talk Some tips about the challenges we found ourselves into and some tips about managing infrastructure So some of the foundational Practices are well the first one we are doing trunk-based development. We have a single long-lived branch and With GitOps we handle environments as directories in that branch The second one is that we are promoting our code automatically through environments We deploy to dev then to stage then to prod stopping the line if something fails now if environments are directories in Git promotion is going to be about orchestrating commit between Git paths and although there is no automated Promotion in Argo CD and there is neither automated rollback. We do that with Argo workflows Also if something fails, I don't just want to stop the line. I want to recover as fast as possible from that. So We want to instantly roll back and if promotion our git commits rolling back is going to be about Reverting those git commits and you can see on the right side of the screen How a rollback actually looks for one of our repositories today? So I'm briefly going through this because we already had a talk at Argo con last year where we Went into details about all these practices, especially this one And if you want to see what is workflow actually executes. I recommend you to go and watch that one Another practice that we've been doing and we've been migrating to Argo is deploying pull request into preview environments This is very useful because Yeah, we can test in isolation parallel features. We are working as a team and it also opens the gate for Auto-merging updates such as library updates with tools such as renovate because we can make sure that those changes are deployed And tested in preview before they reach the main branch So and the last two practices are all about minimizing impact. We have Wave deployments, which would we do with Argo workflow back template and the last one is about Rollouts rollouts actually helps us a lot because we can now bring progressive deployments progressive delivery to microservices that Greater set of microservices because it was harder before to do this to do this setup with spinnaker in a cayenda but Argo CD and gtops are very great tools to Create an orchestrate continuous deployment However, there are some things that came out of the box with tools like spinnaker which require work in Argo And that for that work We needed to come with some creative solutions Some I'm going to walk you through some of the tips we have from the use cases. We've got in Adobe experience platform the first one is about Service dependencies in practice You'll often need to deploy applications in a particular order like you would want your Infrastructure to be deployed first or the monitoring stack or the secrets operator and then the application I have here a very simple example With a secrets operator marked by the way you can do this with sink wave and The secrets operator is marked with sink wave minus one My application doesn't specify a soon sink wave and that places it in the sink wave zero So it's fairly simple Argo CD is going to deploy first my secrets operator And when that's healthy it's going to continue with my application Okay, this is simple. Let's add some complexity with hooks so I Hooks are part of different synchronization phases than the one you just saw than the sink phase You can have multiple precinct hooks then the sink phase and in the end multiple post sync hook I added here a post sync job Named test. It's not a random name and we will see it in a few seconds And let's suppose that this job speeds up some pods that some pod that's running some tests and the tests are failing Contrary to expectation the Argo CD app Health will not reflect that job resource status. The app will say I'm healthy I don't care for the job because hooks are not desired state. So hooks are just some scripts You can use to achieve a desired state They are they are transient. They are created and removed over and over, but they are not desired state Moreover for the same reason if you have just hooks defined and you don't actually have it state to to sync the hooks will not get executed But I still have this use case of running tested at Adobe Actually, I have the use case of running post deployment test such as functional Integrational and so on and you might say wait a second. You talked about Argo rollouts Why would you want to run this post deployment? Since you have rollouts and you can run them pre deployment or even during the the deployment That works fine for production For that actually you don't want to replicate all those canary steps in there. You don't want to Reduce the impact you want to reduce the overall deployment time and perhaps the better reason is that Yeah, Argo rollouts doesn't really work yet with streaming applications and in Adobe experience platform We have a lot of streaming applications So what can we do? the first option would be to Assert that held outside the Argo CD So Argos it is going to say I'm healthy But it's going to output some extra information about the phases phases that failed particularly this hook So I can take that output and parse it in another CD tool Which is my Argo workflow and decide based on that and this is what we are actually doing The Argo workflow will parse this output and in this case it will stop and roll back another option Would be to make the test desire state as you can see I removed the sync hook and I have a sync wave here This is very tricky to do honestly Because jobs are meant to be one-time events in Kubernetes To have them synced every time you need to remove them beforehand. You can achieve that with the precinct hook At least until Argos it is adds better support for replacement of resources such as job However, I would not recommend this unless you don't have any other option We have other use cases about environment configuration so I think in the last year github's users were very concerned with changing the source code and Promoting that new container image Through environments you saw that we are doing that with Argo workflow There are teams doing it with github actions in combinations with Argos CD Notification or they are doing it in combination with Argos CD image updater, which you should always use with With github writebacks always with github writebacks to be github's friendly but yeah, there are some other changes my application is subject to such as Business configuration and the most common example here is adding a new environment variable and I often need to change deployment configuration the underlying hand chart or The Kubernetes manifest So, why do I have this issue of making these changes in config? There is a practice Recommended from the Argos CD community that you should separate your source code from the configuration code There are various reasons you would want to do that like you don't want to trigger unnecessary CI builds if you changed only the config Maybe you want to enforce ownership because you have different teams and it's easier to do that with two repositories However, there are ways to address those issues. You can filter CI events to not trigger unnecessary builds You can use code owners to achieve that kind of ownership at the folder level But the problem I have with these two rappel approaches Let's say I need to change the source code, but I also need to add an environment variable Once I merge the poor request changing my source code that change that container image tag is going to be promoted automatically until prod up up to prod actually Whereas to change the config I need to do another PR in another repository and PRs are manual They are breaking my continuous deployment pipeline So Yeah, the struggle here is putting everything in one single release and I want to emphasize a bit how important that is because It's especially important when you are migrating off a tool like us like we're migrating off spinnaker Which had this out of the box? And I heard a few other I heard a talk actually from sneak earlier this week having the same issue And they were mentioning the cognitive load this has on their developer to handle multiple Repositories So you can say to me now, okay You have ways of addressing those initial issues about the two to repository approach do it in a single repo and I can do that. I can now atomically change both config and source code however, I have another issue my CI and part of my CD start in the same time the Config is going going to get shipped once I merge it immediately after I merge it and My source code is still being built by the CI But you can say, okay, you presented us hooks Why wouldn't you use a precinct hook to just wait for the CI before triggering this thing? And actually I can do that. It might take me because it might take me some time because it's not that easy, but Let's say I can do that. However, I have other issues which revolve around code or config organization Okay, I hope this shows. Okay, so with environments per folder You need to define a git state for all of your environments a Disclaimer here. We are using helm and in helm the templates folder will be Will be the place where we will just put our manifests our deployment services and so on The duplication here is obvious. I have templates over all over the place So what can I do is use an umbrella chart instead? and the idea about this is Define templates just here at the bottom of my screen in a single place define the chart in a single place and then Reference that all over the other charts that are supposed to be rendering environments And I have here an example of a helm charts. We actually use in one of our Repositories you can see that it uses dependencies to reference other versions So I can't I kind of solve the duplication problem, but I have another problem which is about overlays so Using helm or customized will allow you to define manifests without repeating yourself, right? As you can see here, I have a values a base values File which acts as a base and as I go through the hierarchy of folders I define more specific overlays or values files This is fine because I want to use dry and that do not repeat myself, but This isn't quite compatible with promotion so if a developer wants to do a change for all the environments He or she They will go and change that They will do that change at the top level at the base level because it makes sense and they want to comply with dry. However that will trigger a Reconciliation and a deployment in all of the clusters including production bypassing Promotion and this is a big no-no. I don't want to deploy directly to production. So There is a pattern documented from Codefresh actually saying that you can do this change very specifically and You do that for all the environment and at the end in the last commit you can just deduplicate basically the the common configs in the base And this last commit it will not trigger another deployment because basically the reconciled manifests are the same nothing changed However, I find it hard to implement this especially in our organization because again developers need to need to be trained to know how to do this incrementally and This is a very different way from what they did before Argo CD and of course I have multiple commits here multiple pull requests and This is this also breaks my continuous deployment So the conclusion we came up with is that we cannot use the git environment State to neither declare charts nor to Organize them with overlays. So what we do we declare charts and overlays next to the source code So that the so that the developer can evolve this together, however this First repository is no git state. There's no Argo CD controller watching and applying that When I do a change here, I have An automated process, which is my CD my Argo workflow That's going to mirror these changes in an actual git state and during this mirroring is going to promote also So on the right side of the screen. I have the actual git state what the Argo CD controller is actually monitoring And this repo is just for deployment purposes and should be a changed only through automated tool tools So how Does my CIT the pipeline look like when I do this change? Now I don't only Promote let's say I don't only create a new container image I also push a new chart version in my CI and I also pass around the revision shot for that commit and in my CD workflow Every promote step is going to promote the image promote the chart and it's also going to copy those specific configuration for that environment and it's going to do that for all the other environments ensuring testing and automated rollback if that's the case and a single note I just want to make here is that we actually support only these two layer overlays So the value is the base one is going to be promoted together with the chart Three environments and the other ones are going to be copied around basically and that's because this limitation of two layering It's because we are still using helmet the other repository If we would switch to rendered manifest here That would help us to have Many more layering on this side Thank you Okay, so not sure if you pick up on it. We're really big on CI CD. So we believe in having these fully automated Promotion workflows starting from the moment that you know a developer commits a change basically a PR gets merged and the reason is you know, we're trying to have this streamlined CI CD process that helps us to React quickly to any kind of you know changes in requirement so that whenever there's a change needed you know, what we do is we quickly iterate on it and we We have a CI CD workflow that actually makes us confident about the you know the Feature that we're building and we're also shipping it directly to production. So every commit basically gets to production as fast as possible. There's no Like one time release happening Every week or every month we just deploy every Commit to production and it's important because we're also going to talk about how we manage infrastructure The thing is that today We have this workflow about you know deploying applications, but Infrastructure is sort of treated separately and maybe this will look familiar to you for us, you know, it's been Typically about using Terraform, I think, you know, Adobe is a large company So we can't really speak for the entire company, but at least in our group, which is about 600 engineers Most people most teams use Terraform. So whether you know, they're using Terraform with this local Feedback loop trying things out with Terraform and then you know doing plan and then doing the apply Or whether, you know, that actually is then picked up by some sort of a CI. We you know, you use spinnaker in the past So once the changes were in git for Terraform, there would be some sort of CI I think those changes and then applying them automatically and even I think with you know with the adoption of Terraform Enterprise, which happened in our group we had a more I Think a better framework ice, you know for making these changes because now the the roles of Developers and operations were much more well-defined, you know in terms of responsibilities and Again the process I think sort of grew in I Think in not only in structure, but also in security in the way it's supposed to work However We still have a problem the main problem today You know for me as a developer as an architect working with a few teams The main problem that I see today is that we have on a workflow for provisioning infrastructure You know and it doesn't really matter if it's Terraform simply Terraform or if it's Terraform Enterprise And you have different workspaces But we have this workflow That is actually separate it's distinct from the workflow to you know to deploy applications to build and deploy applications And the problem is that they have to meet in the middle And this actually leads to a bunch of problems that you know, I've seen a lot of times One is it's actually quite inefficient because you have these various teams or sometimes it's even the same team but with different Different roles in the team that handle one or the other that have to Work together on workflows and then have to meet in the middle. So it's actually quite inefficient The other is that the risk of errors is actually higher because again, it's not developed in the same time It's not deployed at the same time It also leads to a longer feedback loop, which is quite a problem because again, we're big on CICD So if the feedback loop Increases then, you know, it's not really effective. So you're kind of missing out on the whole idea of CICD, right? And Finally with something that I found happens very often and you know, maybe you've seen this before as a company As a large company, we've been using a lot of tools like many tools And I think it's great like in the ecosystem that there are so many tools and there's so many ways to build things The problem is that sometimes you have to Not only understand and be able to use but also graphs the complexity of so many tools that it Sometimes becomes unmanageable. I think at least if you look at the organization from a, you know From a scale perspective like how how much can you scale your development if your Every team has to Understand not only Jenkins and Spinnaker or Argo CD, but they also have to understand Kubernetes Of course, they have to understand many other technologies that they use so it becomes unmanageable at some point And this is what we you know what we're we've been facing and some of the problems that have been I think big for us So I've been looking and I've been working with some folks to figure out, okay What would be a common approach? What would be a way for us to have a single unified approach for developing and deploying applications, but Infrastructure as well The idea being that we can iterate faster in the end and if we iterate faster then we actually increase quality and we You know, we have a faster time to market and we respond faster to customer needs which in the end is I think the goal for us all so What we're saying is hey We should have a single unified CC process a single way to to make things and then have a single source of truth for Deployments and I think you're sort of picking up on where I'm going with this So we're looking at cross plane Cross plane is you know the way to transform your Kubernetes cluster into a controlled plane And I think that the mantra of cross plane and I think you even I heard this in a presentation today Or yesterday sorry is that you know as long as there is an API you can you know You can configure cross plane with a provider to connect to it, you know order pizza, right? So for those of you who if you're not familiar with it a Realistic view is that you know as an in an application you can actually use manage what cross plane calls manage resources typically via claims and then cross plane will actually watch those claims and Through the use of providers connect to let's say public cloud infrastructure and you know send up your your infrastructure That's the gist of it Of course in reality, it's actually a bit more The model is actually more complex. You can have a separation of roles in the company You can have operations people work on compositions and you know developers They use a more let's say simple abstraction Using claims so the way are go Which we have been migrating to and cross plane work together is that We actually keep the application manifests like you know a deployment configuration But also the infra manifests like the infra provisioning claims We keep them in the same repel and then we have our go CD sink and deploy that just as as usual, right? It's not any different than what we've seen before And then that deployment which happens in a different cluster where we actually deploy the application That deployment contains the cross plane manifest which cross planes picks up And then uses to actually provision the infrastructure in you know our public Infrastructure, so I'm going to walk you really fast through a simple example Let's say we have a rest API You know that collects audit events and then exposes some sort of an API to query them, right? And we're going to use postgres for this It's a simple example. We're going to use postgres Let's say in a cloud and then we're going to put a firewall in front of it so that you know only Certain IPs can you know can connect to it? So the way we would do this is like a you know, Larisa was saying earlier. We're using a helm So we would extend the helm chart with let's say a few Resources which are cross plane Resources we know we define a resource group a server maybe a database And it would look like this. I'm using the Upbound provider for Azure here As you can see selected. We're basically sending up a flexible server of postgres server And you know because I'm a smart guy. I'm also making sure that You know I get the cross plane resources to be synced before the application So I'm using sync ways with Argo and Hopefully you can see this So this is Argos my Agro CD application trying to sync And what you can see here is that I have my let's say server resource That's in fact managed by cross plane and it shows up. It's synced everything's good. I also have my DB which has been You know synced so fine and then my deployment is Progressing I have a pod that's being stood up If I look at the server, you know the details it shows up as okay in Argo But if I look at my pod shows up as degraded and then If I go into the logs, I can see that it actually times out and the reason is that even though My servers seems up to be you know have been provisioned by Argo and cross plane In reality It's only been requested so the server has only been you know requested for provisioning to cross plane But in reality hasn't been yet Created and the main reason is that Argos CD will actually look when it you know determines the The health of an application. It actually looks at the state of the cross plane resource So the you know the resource here is only in the cluster, but not at the actual provision infrastructure So your server might still be you know Being raised while the cross plane resource will say everything's good to solve that I don't think you can see it, but I'll try to walk you through it to solve that. We're using custom resource health checks in Argos CD so you can basically configure Argos CD in its config map or you can even build Argos CD with it and Then we're taking some of the resources that we're using and adding custom health checks for them So that depends on the way the provider works But we're basically picking up on some of the events that are being sent and now some of the status is for instance you know if the Status is true for the event ready. It means that the Infrastructure has actually been provisioned and you know we can also treat some error scenarios And if I try again this time with this configuration updated in Argos I can see that now my application shows up as out of sync while it's waiting for the database server to be provisioned and Then here I can have the firewall. I see the firewalls being created as well And then I have my database being provisioned and Only then after the infrastructure has actually been provisioned in Azure in our cloud Do we see the deployments happening application deployments? and Now the port shows it as healthy and the application shows as synced and healthy and You know looking at the application. I can see it's running. I I can even You know make a few queries against it and it's actually functional so The way this all fits together because I think this is where we're sort of hinting to is that we're adding Provisioning infrastructure as part of the Application deployment the continuous CI CD pipeline. So we're taking application and infrastructure through the same kind of Workflow and we're promoting it from int to stage to prod And finally just want to end up with a few takeaways We kind of skip through some some things for the sake of time So, you know in reality you would probably use compositions and claims with crossplane don't use manage resources directly That's actually not a good pattern And also I think that there are a lot of things in order to be able to do these kind of Full CI CD pipeline and deploy each commit to production You actually to make a lot of due diligence beforehand and we do that and if you want to check that out It's actually in the presentation like I saw was referencing earlier that we did at Argo con last year second I see I feel like the CI CD Ecosystem is really vibrant. I you know, I love Argo But I also feel like You know, these best practices are still in the works third Because we have this sort of distributed system. We have control playing clusters and then Many remote clusters where the deployments actually happen Monitoring and observability are really important and you know, they're quite problematic at scale and for that The this problem about you know multicluster multi-tenancy With Argo, but also with crossplane is actually quite hard to to figure out and I think we're still sort of working through it Yeah, so that being said that that's this concludes our presentation If you have any questions, there are mics on the side of the of the room. Thank you