 Hello, everyone. Welcome to GitHub's Con 2022. Today, we are going to talk about intuitive progressive delivery across microservices in a dependency growth. Myself, Hari Rangali, engineering manager at Intuit. And I'm also one of the maintainers for our project, specifically, our Google Outs. And my co-spaker, Rohit. Hi, this is Rohit Agarwal. I work at Databricks as a software engineer. Some of the areas I'm focusing right now is traffic infrastructure, service mesh, and also save deployments. I'm an active contributor and maintainer for Argo RoLos project. Let's talk a little bit about Databricks. So we are a unified analytics platform, helping our customers get insights out of the data. We did 800 million in ARR last year and have seen 75% year on year growth. We have offerings across all three major clouds, AWS, Azure, and GCP. We are an active user of Argo project. We are currently working on scaling thousands of deployments across 40-plus regions using automations built on top of Argo RoLos. Intuit found in 1983 and evolved ever since for the past few decades, currently at 12-plus billion dollars of revenue and 100 million-plus customers. Primarily, a phenotech software with well-known products like TurboTax, QuickBooks, Mint, Kredkarma, MailChimp, and so forth. Intuit invests around 2 billion-plus in R&D. And it's very tech-heavy with several well-known open-source projects coming out of it. And one of them is Argo. Argo, as you all know, is a set of qubitness native tools to manage jobs, applications, and resources. Started as a part of Apartix and acquired by Intuit in 2018. And now part of CNCF as an incubating project. Now on track to graduation. As Argo evolved, there are multiple projects come out of it, solving various use cases. And some of the projects are Argo workflows, which is a container-driven workflow engine with use cases that solves batch processing and map reduce and some of the domains like AM modeling and machine-lending topics. Argo CD, on the other hand, is a Github-based, continuous delivery system with rich user experience to manage these resources. In Argo CD itself, it evolved with several sub-projects like application sets to manage clusters at scale and notification engine and so forth. And Argo loads to bring in some of the advanced progressive delivery or deployment strategies and to solve progressive delivery with automated roll backs and some of the advanced topics like A-B testing using experimentation and several capabilities. And last, Argo events. It's an event-based dependency manager. Argo project or ecosystem is continuous to evolve, and it's currently adoption status in GDOPs. Today, we are going to show you how we can use this Argo ecosystem in solving the current use case. Before going to the progressive delivery in a dependency graph, my co-speaker will talk about what is progressive delivery and what is it for the microservices. Let's take a minute to recap what progressive delivery means. Progressive delivery is built on top of continuous delivery with more features like gradual rollouts, canarying, A-B testing, metric analysis. In a nutshell, it's these four things, releasing of a product in a controlled manner, having complete control over the last radius during the updates, supporting bring your own metric analysis, and then finally having automated promotions or rollbacks as per the KPIs. Let's see how progressive delivery in a microservice ecosystem works. We have a stable version for a service called V1. We are trying to roll out a new version, say V2. We do canary. We collect and analyze the matrix. If everything is looking good, we mark the version to a stable. If not, then we roll back to version 1 by shifting all the traffic back to the stable version, and then we maybe trigger a notification saying that your new version rollout fails. Now let's see what progressive delivery looks like with Argo rollouts. We have a controller which interacts with a custom CRD called rollouts. We have an analysis template which contains a set of rules or steps on how to shift traffic and perform analysis. We have a stable replica set that you see at the bottom. Now when we are trying to release a new version of the service, the controller will create a new replica set for the canary parts. It will also create an analysis run object, which will analyze all the matrix. We'll start gradually shifting traffic from stable to canary, depending on the rules. Both the stable as well as the canary versions are emitting Prometheus matrix. Analysis run is consuming these matrix to make decisions. And based on whether the analysis is successful or not, it will either shift more traffic from stable to canary, or it will roll back to stable version, and then we'll just mark the canary as degraded. You can also integrate it with service mesh if you want more gradual control over how you shift traffic from stable to canary. With that, I'll pass it on to Harry to talk about service dependency graphs. Thank you, Rohit. As you all know or see the mesh architecture for the services, it's not a new concept. As companies grow and scale, this kind of dependencies happens. And let me talk about the situation at Injude. Injude currently has 2,000 plus services, and several of them form these graphs. And we have several incidents happen because of a change in one of the services that has impact on doms to services. These are some of the snippets that I posted here. You can read through it. Sorry for masking some of them for the confidentiality. And if you read one of the snippets, one of the services changed because of a change in use in or which is not communicated to the impacted users. And the impact is the login failed. And what they did is restore the one of the impacted change to a previous state, that is a stable state. And these are just the few of them I reported here. But in Injude, we have like 30% incidents happen because of these changes. And we measure a KPA called MTTR, which is nothing but mean time to restore, to measure the operational excellence of these services. And what we are trying to do is to reduce this MTTR to as low as possible. And for microservices, we are using rollouts and a progressive delivery capability to reduce this MTTR. But let's see how we can do in services that are in a graph model. To move forward, let's look at some simple graph. Let's say there are four services, S1, S2, S3, S4. And there in a state, V1 state, there's a stable one. Let's say there is an induced change in S4. And because of the change, hypothetically, let's say the service one got impacted. What ideally needs to happen is it needs to roll back to the previous state so that everything needs to be done automatically. Usually, in companies like Institute and all, there are some coordinated teams, or integration teams, which work with all the services in that dependency graph. They release, and they monitor, maybe looking at the dashboards and everything. One of the main things is there are a lot of manual steps involved. There are a lot of manual hours that are put in to make sure the release process moves. But how we can make sure all of this can be automated and can smoothly release and roll back if there's an issue? And in the next slide, I'm going to show how or use our Argo ecosystem projects and create an architecture or a flow to solve this current use, current problem. Let's read through from left to right, down and left. So start with a change in S4. And what happens is a change in S4 triggers a canary and run the analysis. If it is successful, move forward. Or if it is failed, you roll back. As I mentioned earlier, our rollout or close delivery happens in a microservice. This is usually a capability that is provided by rollouts. Now, let's say it is successful. Now, it should go to or trigger notifications as an event to the downstream dependency services. And we are using Argo events to capture those events. And that one integrates with workflows. What workflows can bring in is creating a batch of jobs for each of these S1, S2, and S3. As you've seen here, each S1, S2 batch jobs or, again, jobs that run KPAs to measure these metrics against these services. If everything is successful, it will exit itself. Or if anything fails, what needs to happen is it needs to roll back. And we are using Argo CD capability to trigger a rollback to a previous state or a stable state. So this is currently the workflow or the architecture that we will be demoing in the next few slides in full screen. So as you see here, here I'm showing in Argo CD, which is running locally, with four services. Rollouts demo, which is where I'm inducing a change. And there is three other services, which I'm treating as a dependency one. And there is a rollout and workflows also deployed. And this is the workflow UI, where there are no workflows. And let's go through in detail about what rollout demo has and what are the specifications. And also I will talk about what are the templates that are created in this. So if you see here, this is the notification, which is introduced, to send the notifications. So let's look more detail in terms of what rollout patch has. So on rollout completed, send the above notifications to the downstream services, which is again captured by workflows. And the next thing is the traffic routing using STL. And there is an image that is the yellow that is covered. And let's look at the workflows as well. Here I have a workflow template that is added. And here in this template, I have a series of jobs that are defined, which actually runs this analysis. And if there is issue, whether it is success or failure, an action is defined to roll back or roll forward as a part of this workflow template. Let's look at, again, there is no workflows. Only the template is binding together. Now let's look at, now let's introduce a change in rollout demo. So in the current rollout demo, it's a yellow. So I'll show you the rendering part, where it should have a yellow color is rendered, just for the demo purpose. And I'll show you, I'll change it to blue. So as I said earlier, it's a GitOps model. So Argo CD has a GitOps delivery system. It will recognize the GitOps change. And if you refresh, it has the capability to show the autosync from the user interface. Again, teams use autosyncs and everything, which are, again, some of the capabilities Argo CD already provides. But right now, even if it at least, we make sure like everything is pipeline driven. And as you see, it is a yellow and blue is the only change. When I sync it, it should create a new one. Again, as you see, I'm doing back and forth just to make sure there is no workflow created at any stage until the rollout is successful. And when the rollout is created, the canary of the disk is created, there is a profit routing in the virtual service and the analysis is also running. And as you see, the blue and yellow are rendered in this graph. And these are also, we say it's where the other cars like purple, right, yellow are defined just for the demo purpose. It would not much of meaning. And as we are care about, once this microservice is successful, what happens after that, right? So that is where I think we can use promoteful to move forward just to skip this all analysis. Still no workflow. So let's do the workflow or promoteful so that we can capture it as a rollout completed which automatically trigger these notifications and it will be captured by the event and for the workflow. So as I did, it is completed, there is a workflows created. As you see, these are the batch of jobs. And if you see the template also, there is a, okay. So here what I'm doing is I'm producing an error rate. I want to show them or an impact because of the change in rollout's demo. And what you're seeing is the jobs. I've configured for one minute, but you can configure for days or even hours based on the use case. You can run this KPS against it. And if there is an issue happens, what happens, let's give it a second or so so that it will be completed. And if there is a change happens, what actually happened is you need to revert based on the change history to the stable state. Once all those jobs are done, it detects there is a change. So there is an action downstream which will verify the state of each job and there is a rollback detected because of a change or the error that is introduced in one of the service. And I can show you in the side by side what happens when the rollback is completed and there's a smooth transition from blue to yellow which is the yellow is your stable state. So if you give a second or so, this change will, once the rollback is completed, you'll see smooth transition from, rollback is completed here and you'll be able to see this change in a second. So you saw that blue to yellow is changed because we triggered Argos CD APIs to call, move to the previous state because we saw a change or something impacted in the downstream services. Let's go back to the Argos CD UI and you'll be able to see if we refresh, you'll see this diff. So what diff means is it just triggered yellow to, blue to yellow. Ideally, you can also use the GitOps model where you can send a commit to the GitOps and then get it completed as a part of it. As a demo purpose we showed so that it will be more readable. So this is the current demo which will show you we're using Argo workflows, Argos CD, Argo loads and events to solve on use case where it is a very common use case that companies are seeing. The next, we will, Rohit will talk about production readiness and some of the features that we have identified as a part of this use case. Rohit. All right, let me talk a little bit about production readiness. So one of the pain points we had at Databricks was on configuring the right thresholds or right set of matrix so that it doesn't trigger any false positives. We recently added a new feature in Argo rollouts called dry runs which allows you to mark a set of matrix running in the drive-in mode. These dry runs can help engineers to add new matrix to their existing rollouts object without worrying about the risk of impacting their current production workloads. It would only emit matrix but then have no impact on the final state of the rollouts. Here is a quick example. If you are able to see, this is how you can add more matrix to the drive-in mode. Matrix name here is a reg X and you can specify how many matrix you want to run in the drive-in mode. It would not impact any final state of the rollout. It would just emit the matrix. While making some of these matrix run in the drive-in mode, you also want to capture all the data points and not just the top 10. Before or prior to version 1.2 of rollouts, it would only capture the latest end measurements and then would garbage collect everything. So we also added a new feature for measurement retention with which you can customize how many measurements you want to retain. You can retain like 100. You can retain 200. You can retain everything which is getting emitted. This is a very similar idea. You just have a measurement retention. You specify the matrix name and then you specify the limit, like how many measurements you want to retain. We are working on adding more observability. It wants observability so you can just export all the analysis results in S3 bucket, maybe in file system and then do any post rollout analysis if you want to. You can read about it. We just have a very nice blog published in version 1.2. If you want, you can read about like drive-ins and measurement retention on the blog. Let me talk a little bit about what's next. So we have these cool features coming in in the new version of Argo rollouts. There are drive-ins which is currently a service level. We want to expand it to the dependency level to get more confidence. We have graph generation based on the traffic flow. We will be adding anomaly detection with intuitive scoring. We are very excited about the traffic mirroring. You can mirror traffic to a set of upstreams without impacting the final user. We will add features for ad hoc analysis, more notification support, and then the drive-in simulation across services. With that, I'll close it up and open for questions. Thank you so much for attending the talk. Great. Are there any questions? If there are, we've got a mic over there and I can also run a mic to you if you're having trouble getting up to it. Hi, I'm Ken. About the dry-out, the dry-run plugin, is it supposed to be part of a process when going to production? How is it integrated inside the process of CDN? Right, so with dry-runs, what you can do is, like if you want to run the analysis and you don't know what is the current threshold, like you're trying to add more rules but then you have no idea what the threshold looks like, you can run those matrix in dry-run to collect more data and then you can fine-tune your matrix, like, okay, whether 10 errors in one minute sounds good and then you can take it to production rather than just estimating, oh, I'll just configure it for 10 errors in one minute and then it fails, it rolls back, it trigger a bunch of alerts to your on-call and it's just a nightmare. Any more questions? So well, there are no more questions. If there are, just let us know and you can also, you'll be around as well. Yeah, we'll be around if you have any more questions or any of the features. Yeah, and if you want to see anything in the future in Arboreal Laws, please come talk to us. Cool, thank you. Thank you so much. Thank you so much.