 So, we at DevREP also had some troubles and some issues with our legacy CD provider and then we started to look into other solution that's when we decided to migrate to Argo CD. So, hey everyone, I'm Kush Thirvedi. I work at DevREP as a platform engineer. Hey everyone, I'm Kush Pumittal. I work with Kush and others in the infrastructure team at DevREP and today we are going to walk you through the journey that we had while migrating from an enterprise CD tool to Argo system and talk about our learnings regarding different areas, design choices, department strategies, automatic promotions and to just let you know this talk is not going to be around any documentation walkthrough. We'll be totally talking about only our learnings and it's opinionated towards how we leverage Argo at DevREP and they can always be other ways to use it as well. So, at DevREP we are building a first-ever platform of its kind which is focused on bringing developers, also known as Dev and customers that bring us revenue, also called this Rev together. It is your one stop for managing everything from an issue management system for developers and support ticket management system for customers together and linking all this information to the product itself at its core. So, we had a very interesting journey in last one year where we moved from better release to general availability and we witnessed approximately 10 times growth in the number of microservices that we had. So, with this kind of pace, our existing solutions started to slow us down. It became difficult for us as infrastructure team to keep up with the development pace and there were some technical challenges we started facing at that scale and also with the number of microservices increasing, the cost for the tool went skyrocketing high and it became even more urgent for us to look at the alternatives. So, let's talk about the problems a little bit more. Apart from the costs, we also had more technical challenges with it. The first one being that the tool itself was non-cubinatus native. We could not manage that tool like any other cubinatus object itself. We had to configure it in its own customized way and have different repositories to manage its own config. So, it was adding more operational work for us to manage it. And it also introduces a lot of tons of abstractions and took away a lot of logic from us, you know, worked behind the curtains and it did not make us feel like we were totally under control of our systems, but there was somebody else controlling it. And the second major challenge was of its invasive nature. So, this tool required us to have its cluster, its agent running in our cluster with complete RBEC controls and have dedicated compute nodes. So, this was also becoming of a concern because it is one thing to give remote access to your cluster to an orchestrator, but it's totally different thing to have its agent running itself, you know, with inside the cluster it has to manage. So, also the last but not the least, the only way to manage the tool was using imperative pipelines, everything we did was using UI. There was no central method to control the config that goes in that tool and there was no automated ways. We did not have option to use GitOps to be able to manage those pipelines in a more auditable and automatic methods. So, as soon as the cost started rising more and more, we started getting pressure from the upper management that you should look into saving some money for us. That's when the platform engineer took it on themselves and started jotting down what all the requirements we might want to have from our new CD tooling. First one was the developer visibility. So, a developer should have enough visibility that how his application is behaving, how many pods are running, is everything green or not, without having to go into the cluster, use kubectl and use the command line. Second one was the observability for the operators themselves. How many applications are getting deployed daily? How many failures are there? Which is the outlier service? Which service is causing frequent downtime? And third one was self-reeling. So, if anyone actually went into our cluster and deleted some manifestry manually, the tool should be capable to actually compare it with the desired state and bring back the things automatically. This is also similar to drift detection. If there's any drift in the actual config applied in the cluster and what we wanted the desired state to be, it should be capable to sync it itself. And apart from that, few of the other requirements were is deployment techniques like rollouts, like secondary deployments and bugle rollouts. Next one was it should be kubectl native. We should be able to manage the tool via a set of kubectl itself without going into other realm of configuring it. And at the last, it should be non-invasive. We didn't want to provision another Excel or two Excel notes just for agent to run in a cluster. So, after doing a brief analysis of all the available CD tools and other ecosystem tools we had in the market, we reached into the conclusion that AGO ecosystem is the one which fixes the most. It had AGO CD, which is the declarative GitOps continuous orchestrator. It had AGO events, which is an event-driven workflow for communities which can actually listen to various event sources and trigger AGO workflows or other lambdas and other things. And then we had AGO workflows. So, AGO workflows is open source container, container native workflow engine, which can orchestrate multiple cognitive jobs at scale. And at the last, but we had AGO rollouts, which gave us the capability to use edge deployment techniques like Kenry, Blue Green. So, let's look at what design choices we had to make while we were adopting specifically AGO CD. So, the very first choice we had to make was choosing between imperative and declarative methods of deploying. So, most of the CD tools out there, they do offer declarative pipelines, but those pipelines are mostly focused on how the deployment should happen. And there is not so much about what is it that we want to be deployed. With AGO's ecosystem, however, it offers you AGO CLI if one chooses to go through the imperative method. So, you can use AGO CLI to define the sequence of your deployments and write your own custom scripts, build your wrappers around it, and have your own customized deployment pipelines. However, AGO also offers truly decorative methods to manage your deployments. So, with AGO, you can just define your end results that you need in terms of Kubernetes manifest. And AGO takes care of deploying it for you, and you do not have to really focus so much on how it happens, but it makes sure that what is the end result is always maintained. So, this is that deployment configuration that AGO uses as reference to detect any drifts, like Kush said, and if it has to act upon it for self-healing, prune bad resources, or if somebody had interacted with the cluster manually and spun up some resources or changed anything, it reverts all of that back. So, we, all of the talk going forward will be in the context of declarative pipelines because decorative method of deploying things because that is what we at Devref chose to adapt. The next design choice was around multi-tenancy. How do we serve different development teams using the same AGO CD without giving too much access here and there and not having to add operational overhead for us to maintain AGO itself? So, AGO offers a concept called apps of apps where you can have a dedicated AGO app which consists of the rest of your AGO applications which manage your microservices. You can also choose it to do this per team basis. You can have one root application per team and that would consist of the applications belonging to that team itself. Along with apps of apps, it also has an application set functionality. So, with application sets, it gives you a new number of generators that you can use and create a template for deploying one microservice to different environments. So, as a result of those generators, you will have your end applications. So, for example, here in this example, in this snipper that you see here, it's using three different generators, merge, list, and get. And at the end of this, while this will be rendered, we will have two different applications manifest for dev as well as prod. And it is going to take care of all the configuration that it needs to know about these target cluster from the YAML that's specified there. So, and all this is done with a single set of YAML. So, if tomorrow I have to add more environments to it, QA staging pre-prod, I just need to add a few more elements there update the config file and we're good to go. The third thing on this was, how do we control access to this Argo system for different dev teams? So, we at Debra, we use namespace-based deployments in our clusters where each service is deployed in its own namespace and the access to these namespaces are scoped for developers. And so, with Argo, you can just plug in that kind of system and use the same roles to manage access on Argo as well. So, developers only can interact with deployment activities happening only for their services using Argo. They can view logs, they can even SSH into the pod if allowed and roll back or promote the rollouts and interact with it while the other services, they can only view and not do anything there. So, next one is, how did we actually scale our applications using Argo CD? So, in the documentation of Argo CD, there's quite a nice way to install the high-evaluity version of Argo CD, which can help you in most of the cases if you are managing your microservices deployment configuration in different repositories. But things may become little tricky if you decide to go with the declarative GitOps way and store every deployment configuration in a Mono repo. So, at Debra, we started using a Mono repo to hold out all our declarative configuration deployed things in the Argo CD and things started failing or we started getting issues when we reached the scale of 300 to 500 applications. Few of these issues were that a lot of applications used to go into unknown state and when Argo would scale up and reconcile all those applications, we used to get a lot of ghost notifications into our service channels, which was not that good for us. So, when we started digging into this issue more and more having chat with community going through the code, we realized that the manifest-generated path annotation which actually helps to actually just generate or just sync your single application only works when you are using a web-based approach. So, if you start using Git, Git-based polling, there is a substantial difference in the logic which Argo CD uses and what it can do is it will just check the head has changed and it will try to reconcile all the finder applications which are residing in your Git repository. So, you can imagine that if like around finder applications were getting reconciled every three minutes which was causing a lot of trouble for us. So, how did we actually mitigate this issue? So, ideally we would suggest that if you want to use Argo CD you can actually use per-deploying configuration but since we were already at around 650 application at that time, when things started happening we had to mitigate this issue and what we did was first we disabled the Git syncs. So, we were solely driving our Argo CD syncs on the basis of the Git webbooks or Git events which we used to receive. And now one of you can argue that the Git webbooks or events can be quite unreliable. Argo may fail to acknowledge it unless there is any Git sync. So, what we did was we put an envoy proxy in front of our Argo CD which would retry the field deliveries onto Argo CD until 10 times and if Argo CD still fails to acknowledge that webbook we would get an alert onto our platform channel that things are not behaving properly you can see what's going on. So, with this workaround we were actually able to right now scale to around 1,000 application as well as we needed to increase the number of status processor and the Redis cache instances which is already mentioned in the Argo CD doc if you want to go ahead with the high availability installation. So, enough about the Argo CD. Now let's see how Argo rollouts how Argo workloads and events came into the picture and held us with our CD pipeline. So, we at Devref have enabled our developers to choose between these two major department strategies Kennery as well as Blue-Green and both of these are supported with Argo rollouts out of the box. So, for Kennery one can choose to do incremental rollout in stages send a little bit of traffic to the new version first wait for some time and then keep promoting that to more percentage. So, in this example you see we have this 10 step Kennery promotions where every 10 minutes the rollouts are incremented by 20%. Now, this is a very basic time-based promotion but what we do here is we trigger one analysis template at second step which looks at this metric evaluation which is being calculated here which is essentially looking for number of server errors 5x6 errors in the new version and if that reaches threshold of 5% for three consecutive times in one minute Argo rollout is going to let not promote that version to a stable version and is going to roll it back. So, you can use any metric of your interest or depending upon the application here and have a more informed promotions and not just blank time-based promotions when choosing Kennery. We also have blue-green-based deployments which is basically flipping the switch between n and n plus one version. Now, here as well instead of just making the switch on the basis of the health of the pods itself you can do more analysis. So, we provide three options to our developers. The very first one is mirroring traffic. So, with mirroring what Argo rollout does is that it mirrors all the traffic received by your existing version to your new version without letting clients know and now that kind of traffic generates all the metrics that you need to analyze if your new version looks healthy or not and then Argo rollout helps you analyze those metrics and depending upon the error rate you choose to market stable and promote or you roll back. The other method that we have available is smoke testing. So, developers can choose to write their own smoke test and trigger those after the new version is available and it can be like in the form of a job and depending upon the job status or job success or failure you can again choose to promote or roll back that particular build. The third is using dummy API calls. Now, that could be another use case. The developers could choose to write some dummy APIs and expect some responses and if it doesn't look good, if the job fails, you roll back. So, this and blue green kind of deployment is mostly preferred for applications which are directly end user facing because you do not want inconsistent behavior in terms of what different users see at the same time and this is what happens when you're doing cannery rollouts and also we notice that with some of the applications like web apps sometimes they generate their asset bundles every time there's a new deployment and when we do cannery rollout for such applications the CDN in front starts throwing you 404 errors because it cannot find the assets on the previous path that it received. So, this is the use case where blue green is the ideal solution. So, that was about how do you make sure that a build is healthy in one environment itself. Now, let's assume that the rollouts were healthy the new build was marked healthy was promoted 100%. Now, how do you promote it to the next higher environment? How do you take it from dev to staging or QA and then finally to production? So, we at DevRef use combination of another Ago ecosystem available offerings called Ago workflow which is an engine to orchestrate parallel operations in Kubernetes and Ago events which is another event based engine which can inform you about anything happening in your clusters and you can act upon it. So, what we do is that once your dev goes healthy now that triggers using Ago notifications it triggers a web book to Ago events and inform about this activity passes it the information regarding which application what build version when was it promoted and Ago event then further triggers our Ago workflow. Now, that Ago workflow is written by us which basically updates the next version the next available environment which will be QA after dev it updates QA with a new build and that further triggers the whole pipeline again. Now, Ago CD detects a new version available for QA it picks that version deploys it runs the whole rollout thing once again and once that is healthy now that triggers another workflow for production. So, the cycle repeats until we reach production. Now, some people always also have use case where rollouts to productions are not done in automatic fashion they needs like an approved manual approval somebody has to approve that this looks good so one can also achieve that using Ago workflow. You just need to add a manual step there which expects a user response approval reject and you're good to go for continuous delivery as well. So, now let's see how all the components rollouts CD and events and workflows connect together and how us complete CD pipelines looks at DevRel. So, whenever our commit is pushed to master we trigger our general CI process and as soon as a new image is available in our container history our container history sends a web book called to Ago events with the service name and the image version. From that Ago events trigger and Ago workflow which picks up the service name and the service version and updates the manifest for that specific service which starts the deployment into a staging environment. As soon as Ago CD detects that there is a change our controller controller triggers up and at the every step of rollout we have a NL template running which makes sure that the error threshold doesn't exceed at any stage more than five percent if it happens we roll back the application we mark it as degraded we send a notification to the service team and they look into it but if the application is successful and it survives through all the stages of the rollout or all the stages of the analysis template we mark the deployment as successful and as soon as it's done we send a notification to developer's team on the Slack and from the Ago notification itself we trigger a web book called back to Ago events that now this application has been promoted you need to trigger the dev2qa workflow as soon as Ago events receive that web book it triggers a workflow which further promotes that application to the production and once that reaches or the once the manifest is updated for production same pipeline is triggered rollout controller comes in can re-kicks in there is error threshold evaluation and then there are the same notifications and alerting rules so now let's see how actually operators can have observability around how the system or how the microservices are behaving so it is important for operators to have visibility around how many services are getting deployed daily how many services are getting failed daily is there an outlier service or is there outlier service team which is frequently deploying failed commits which of the services is getting mostly rollback or if there is any service team which needs to look at that this coverage so what we have done is we have an encloster Prometheus which actually scrapes the matrix from Ago CD as well as the Ago rollouts which then sends those matrix to our global Prometheus from which we have a Grafana dashboard visualizing that how many daily deployers are happening how many failures in a day and week you have and on the basis of that we have written alert manager or Prometheus alert rules that if a service say has more than x number of failures and y number of days trigger a notification send to the service team that your application has this many failures you might need to look at how you are deploying things your testing coverage your CI pipeline or you should introduce more tests so this actually helps operators and service team both to have peaceful sleep because they'll know how much things are failing without driving or without going blind in any of the case so that was it folks any questions so the question is if we have different application set for Dev and prod so the application set is just one and because it's a template you can use this as a template and it will automatically generate two different applications for Dev and prod so the set the YAML you only have to be written once and then Argo takes care of creating different applications from it so we use a git generator which actually has overlays for staging and production environment and on the basis of those overlays we generate two applications with production values and staging values yep so the question is do we have a single Argo CD deploying to multiple environments so right now yes we have a single Argo CD instance which we have configured on high availability we have we have around five Redis replicas for that we have increased the number of status processor the controller replicas are more than two so right now we have around 790 applications deployed across three different environments and it's all getting orchestrated by a single Argo CD and as of yet we haven't seen anything going downside only thing which we had issues with was the get-syncs which used to happen per three minutes yep sorry you mentioned you have a master branch right is it the only branch you have and you load gprs against it and you use Argo CD to promote to different environments or how does it work so right generally we promote everything from the main but we also have a generalized cherry pick pipeline so if there is any hotfix or is there any urgent change which needs to go so we have a dedicated cherry pick pipeline which listens from some specific prefix branches and from there we can generate or promote things isn't it easier just to have a separate branch per environment like let's say develop staging and master so we basically have to follow daily deploy so how our complete setup is done is that if an application has enough testing once a commit lands into master within three hours that commit goes into prod via automated testing and if at any stage e2e test integration test anything fails we just send a notification back to the developer team and then they can fix it and again that follows the complete daily deploy pipeline it can land into prod without any manual intervention as of now all right thank you yeah hi with traffic mirroring how do you handle database writes stuff like that or event-based applications so over there so we use SEO to do the traffic mentoring and we have like we have a certain guideline for test suit that you can only have read calls and the get calls to make sure because if you would have the right calls then it can actually result in discrepancies among the database at least for fraud we have this button dev they can actually follow anything because they are actually using themselves you can control which API calls you get via the manifest if it just gets so it's like developer team right what all they want to have and then they come to us to actually control those API calls of what all things they want so over there any of the platform thing can actually review that okay what all things they are doing but we have a specific doc where we have mentioned that what all things should go over there I think we have time for one more how do you manage rollout of pipelines so let's say I have three apps with like dependency on app A, B and C and if I have a rollout of microservice A the microservice B should also rollout okay so over there we are not doing much on the CD part but we have a dedicated CI pipeline and we have integration test per every service which actually ensures that the version which is actually deployed on to prod it will get spin up in a PR sandbox environment where they'll run the test or the API test whatever test the application team has written on to that version so it's like let's suppose we have three application A, B, C is there if there is if there is any change in A we'll pick up B and C version running in prod we'll deploy A into a complete isolated sandbox that's where we'll run the test and then the CD process will begin after that yeah so the dependencies do not reach CD pipelines they just result in the CI part of it as well all right perfect that's I think that's the time we have everybody give them a round of applause thank you so much thank you so much