 Hello, everyone. Good afternoon. Thanks for coming to our session. I'm Praveen Yalagandula. And with me, I have Gaurav Rastogi, my colleague from Avi Networks. Today, we are going to talk about continuous delivery of cloud applications, blue-green and canary deployments. So here are a couple of quotes from well-regarded folks in the CI-CD communities. Martin Fuller says that continuous delivery is a software development discipline where you build your software so that it can be released to production at any time. And Jess Humble says that the goal for continuous delivery or the goal for making deployments of whatever kind of system that you have is to make it predictable and routine affair so that you can perform it at demand at any time. So continuous delivery involves automation and integration of all stages of your software development, all the way from developing your code, testing to the final release. So you want to have a predictable, reliable, fast, and better quality continuous delivery process. And it has lots of benefits. So it results in better organizations because software developers can, it empowers them to be able to develop the software and have this rigorous process where they can develop the software and release and be ready to get the feedback on how well their software is right away without needing to wait for really long times where they develop something today and after six months starts to get bug reports about what you have developed. And for the business perspective, business organizations like the PMs and sales, they love it because they can do lots of features. They release the features to customers at a very short time. And operations definitely feel better because now you have better quality software and you have less downtime and less risk. And overall, this process also increases your security because you have a faster way to patch your applications, cloud applications and hence reduce threats. In the old school way of doing your process, your deployment of applications, the process is pretty siloed. Like you have the developers doing the software development, doing the testing, building, and doing that cycle in each component or the service that they own. And then you have the release managers putting these together, doing an integration testing and generating a release candidate. And finally, you have the operators that are deploying and upgrading and the monitoring and rolling back. Each of these individual silos themselves have several month cycles. So the overall feedback in the system is pretty elongated. In contrast, continuous delivery has a process about having all this in one single cycle. So you have develop, fix, build, test, release, deploy, everything automated and going through in a cycle and have fallback at any point and have this quick feedback that makes it a lot of difference. And the key to be able to get this is to have automation for every piece in this cycle. And another big challenge is you do want to have zero downtime for any of these deployment cycles, like in the deploy process. Otherwise, you will not be able to. That's one of the big things that people don't do. Upgrades is that you have to always schedule for a downtime. But if you have a process where you don't have any downtime for bringing up your next deployment, then overall you can do more of those deployments. And that's where Blue Green and Canary deployment strategists come into picture. So Blue Green is one strategy where you use two deployments of your application each with different versions. So typically, Blue refers to the current and Green refers to the new version that you're installing. And as part of deployment, you switch your sessions to your new version, Green version, and make sure that the validations are going through. And if everything is good, you basically move to Green, and you look at your Blue, keep it for some time, or destroy it, or analyze it. And if validation fails, then you always roll back and then use your Green deployment for debugging what happened. Now, this is one deployment strategy. And the other deployment strategy that's even more powerful is Canary. In Canary, you divert a percentage, a small fraction of traffic or a certain percentage of your traffic to your second version, new version. And you keep testing that. And if the validation fails, you basically simply start scaling back the traffic back to your old deployment. And as you're getting more comfortable, you keep moving the traffic. And eventually, you move all your traffic to your new version and just discard the old version. One of the interesting things about these Canary deployment is that in contrast to Blue-Green, where you are essentially creating two identical setups, that means you need to have 2x times resources at the time of deployment. In case of Canary, you can kind of do it more elastically. So your resources are not like 2x, maybe like some 1.1x or 1.2x. And by having a auto-scaling system setup, you can naturally grow and ramp down on each of those versions. But this requires that your applications be able to handle this kind of elastic scale and elastic moving of one version to another version. To quickly talk about the validation step, like once you move the traffic, then you need to talk about how do you validate once you have these two versions in place. You need to decide on what's your methodology going to be for validation. So that includes figuring out what kind of traffic that you want to send, how do you increase the traffic, how long you keep the traffic going at a certain fraction, and what kind of traffic will you use, and so on. So this one depends entirely on the application, and you have to pick the right kind of model for validating your application. And metrics is very important there to figure out whether the new version is doing well and as expected. And so you have a bunch of different kinds of metrics, depending on the type of the application, all the way from response time, number of open connections, quality of requests, and so on and so on. One of the things about metrics is that you want to pick metrics such that they kind of are scalable. Like if you're doing a Canary deployment and you have 90% and 10%, 10% of your traffic going to the new version, you don't want to exactly compare it to how your old version is, but kind of scale it down. So you need to have your metrics scaled to the right factors. And the other thing is you want to have full auditing on the system as you do this deployment so that you get to see how the deployment, how this migration is going on, so that you can use it for both when you roll back, figuring out what happened, and so on. Now, we said that you want to have these two different versions. One of the big challenges there is, how do you have these versions in place and at the same time be manage this traffic across to these two different versions? So the problem is, how do you orchestrate this traffic switching, either in blue, green, or canary between these two different versions? And you can do it in a variety of ways, all the way starting from your end client that is using the application to all the way on the other side, having in your L3, L2 network, your load balancer taking care of it. So let me quickly go through these different options. So in client application based, say you have version one and you come up with version two, and maybe you push a patch to your applications to start using version two. Or maybe you have a catalog where you post which specific set of your clients to access version two and so on. That's one approach, but it's not a very generic solution. It's not very flexible. And it depends heavily on your application. The other common approach, the other kind of thing that most people do is DNS based routing, where you have your version one and you basically do your version two and replace your DNS with your new IP address. And that way these clients get routed to the new version. Again, this has certain limitations, like many times you have DNS caching going on at different levels of your DNS resolution. And also for kind of canary deployments, the granularity at which you can do this kind of switching becomes very coarse, because you can probably put new servers. But at the server level is the granularity at which you can switch the traffic. BGP based routing is another thing. So as you go closer and closer to the deployment of your application, so you have upstream router. And some way your applications talk to your BGP router and start advertising the new version advertised new route so that the traffic goes to that, while the older version withdraws the route. Now again, this one has the issue that one existing sessions will get disrupted because now you're switching all your sessions to the new version. And also, similar to DNS, you have coarse granularity on your canary deployments. In contrast to all those doing load balancer-based deployments gives quite a bit of benefits because typically load balancers have very flexible APIs on how well you can control how you do your traffic dispersing across different versions. And this fine-grained control on traffic engineering is a key advantage to the load balancer. And we will go into the details of this load balancer-based deployments. So to quickly tell about the load balancer, so AVI networks, we build a software-based load balancer. And it has web application, firewall, service mesh, and built-in analytics. And we have been in use at being vetted by several big companies and been in use for six-plus years and has a very diverse set of full-featured load balancer. Now to go on to the architecture on how it works. So it's built like a SDN-like architecture. So we have separated out the control plane and data plane. And the control plane spins up the data plane instances, the service engine virtual machines, as many as needed on demand. So there is the autoscale at the level of load balancer, too, so that it can do the traffic as needed. And these service engines are virtual machines, or containers, or bare metals. They can be deployed in different form factors across your OpenStack, Neesenter, AWS, Azure, different clouds, all multi-cloud solution, from a single point control. So it's like a true SDN kind of solution applied to your L7 services. And we have a lot of intelligence built into it. One of the topics we are covering today is about analytics-driven continuous delivery. All that intelligence is built into the control plane so that it continuously monitors the entire system and makes certain decisions based on what it is measuring. And we have full-fledged automation. So we have 100% restful API driven. So you can do Ansible-based or Terraform-based automations against the control plane. And you can deploy, do your continuous delivery. Now, the way the resource model works is that you have a virtual service that's on the left-hand side. And then you have pool groups. A pool group is associated with a virtual service that defines what's your sets of servers that you have that are hosting your application. So on the right-hand side, you have two pools. One is like green pool, and the other is a blue pool. And we have these ratios that you can assign to each of the pools to determine whether you're doing a blue-green or you can do a can-ray deployment to slowly change the traffic from one pool to another pool by basically changing the ratios. And we are going to take on from here to show on more demo of how we are doing this analytics-driven can-ray deployment. Why don't you take from here? Sure. Thanks, Praveen. So one of the tough problems we have always faced is live production, right? Like no one wants to touch it. And everyone is scared off and runs into hiccups when they change things. So with this solution or any load-balancing base solution, the whips kind of don't change here, as you see on the left-hand side. But then as we change the ratios of traffic going to different versions of your applications, and this is really pool of applications, application instances, you also want to essentially control how the switch happens, right? So some application, you would want to just test with five minutes off the window. Everything looks fine. You want to switch over and move on. But for certain applications, you may want to run for days and give small loads, 5%, 10%, 20%, and then keep going on. So that's kind of very important. How do you ramp up your traffic to the new version of the software? The other thing Praveen talked about past fail rules, that's kind of integral to the load balancing. The reason we have a lot of analytics in the load balancing layer is one to show nice graphs. I'll be honest. But the other is to actually take that information and then circle back into the decisions of deployment. So we can take the metrics and load balancer can check that, yeah, applications is not exhibiting any errors or it is meeting its SLAs, the latencies are all good, and then it will keep on moving the deployment forward to the new version. And based on the policies, it can automatically roll back or it could completely terminate connections to the previous one, or just bring in the new version to a certain load and keep it there. There's also web integrations to integrate with external validations if required. So let me get into the demo of Aave. So first demo I'm going to show is a live search from a blue to green. And again, the emphasis is on the automation as well as an admin. You want to know exactly what is going on because you don't want too much magic. You want to see predictability, and you can observe the results going on. So let me go to... So this is actually what I'm going to demo is this is Aave Controller. And here you can see when you log in, you have one place to look at all the virtual services. Yeah, so this is all the list of virtual services. And it's a multi-tenant system. Let me just turn around a little bit so I can see both sides. Yeah, so it's a multi-tenant system where you can just pick a tenant. And right now I'm going to show a demo of a container environment with Kubernetes. What you're seeing is actually a VMware and another is an OpenStack. So I go to my tenant, OpenShift tenant, and I can see all my applications there. Now, how is this working, right? So this is my OpenShift, a simple deployment for the demo I have with the one-node setup. And I have a default namespace here. And inside the default namespace, I have a couple of applications configured. So these applications in the services, I have a green service and a blue service. So if I click on the green service, it's a very simple application with service port on TCP HTTP and port 80. And it has a selector published as a VECD service green. And similarly, I have the one which is in running right now is the blue one. And this blue one is tied to the router in Kubernetes if you're familiar with it. So if I go here in the routes of ECICD, this is kind of a router configured for ingress, not self-connection. And very simple application with HTTP and 80 for the demo. And here you can see that the default is set up as blue and it is getting 100% of the traffic. And my green, which has already gone through the full CI pipeline, build, integration test, unit test, everything is ready. Now you're at the last but very critical stage of bringing it into the production. And it is ready for that. Now, once application is configured here, what happens is on OV side it is configured with a cloud. So there is an OpenShift cloud configured here. And if I go to the edit of it, let me go to the admin account. Yeah, so the way it works is the OpenShift, the micro load balancers are set up as a pod. And those pods are visible to OV controller because they are set up with a service account. And that service account is registered onto OV controller. And using that service account, OV essentially becomes the default proxy for all the container apps. And the service engines are actually deployed as a pod themselves. And as the traffic comes in or the service whips are created, the route in OpenShift is created. Service Engine essentially owns that web and configures the underlying container networking so that it can receive all the traffic and then forward all the requests. Now, the question is, why would you do that? And the real reason is with service engines OV solution, you are getting the VAP functionality, security, advanced load balancing, ACLs, all these functionalities that is available in the container environment. So I'll go back to my tenant, go to my application. Here, you would see that this OV-CICD is actually automatically learned from the OpenShift environment. The picture is pretty much what we had shown before, which is the virtual services OV-CICD. And it takes in the route name. And there are two pools configured in it. One is the default blue, which is here. And then the green, which is this one. And each of these have two containers. So the arrows are not showing up here in the projector, but it shows which containers belong to which one. So as an admin, you could go in into the application and see, oh yeah, with their application, everything is fine or not. So right now, you can see the transactions going through. There's about 10 megabits of traffic, open connections. So this is what is incoming. All the clients' live traffic is incoming. And if I go to my pool, here I can see pretty much everything is going to one pool right now. And I go to my green pool, there's nothing. There's absolutely no data, because there's nothing going on. So let me go ahead and now go and change the routing and trigger a blue-green. Now, this is a manual way of doing it, where in Kubernetes, you can actually post an API call and change the weights. And it's going to trigger blue-green. What I'm doing is manual, but it can be done via Terraform, Ansible, any tooling of your choice. And then I'm going to show the next case of doing it fully automated. So I'm going to move green from 0 to 100. And blue becomes 0. And it's going to trigger and update. And Avi is going to pick it up. And let's see, green side, let's give it. So first, I think this event happened, where it picked up the ratio change. And once the ratio change is done, we should see the traffic actually getting switched over. If we go to the virtual server side, while this whole thing is going on, you are not going to see any change in the traffic. All the clients are still coming in. All the transactions are still kind of going through. As is, no issues. You see no errors happening. So in real time, you are switching the traffic. Let me go back to the green pool. And you can see that it has started receiving traffic now. And it is kind of ramping up here. And we go back to the blue pool. And sorry, it's kind of, yeah. You can see that there's no more traffic here. So right here, you can do all your validations on top of it. And as an admin, you can actually see that there are no issues going on. And really go ahead and confirm the change. One more thing that happens here is even though it's a container environment, a lot of these containers are ephemeral, you have full visibility into what is going on. So if I can go into one of the containers and get stats and analytics on each of them, this is what each container's metrics are. So if there are any scheduling issues or if there are any issues in the host, they would pop up here. So given open connections, all this information is right here. Let me go to, yeah, actually, it should have shown the memory. I'm surprised it's not pulling up here. OK, so let me go and show the next one, which is the canary deployment. So here was an example where we set up everything with the load balancer was set up with two pools, two versions of the application. And then the admin via API or through the UI in the OpenShift was able to trigger a blue-green deployment. But this whole thing can actually be set up inside RV using the pool group deployment policy. So let me first show the other app inside the OpenShift. So we go to the application services. And here I have two versions of the app, two services, actually. And they are kind of registered to this RV canary routing. This becomes a virtual service in RV platform. So if we go to the controller, this is the virtual service. And the way to trigger the canary would be those two pools become a pool group in RV. And in this pool group, I have an OpenShift canary deployment policy configured. If we go inside this policy, you can see that all the things we talked about, that you want to have validation metrics. So there are two validation metrics set up here which says that the percentage response errors should be less than the previous version. And at the same time, it should be less than 10%. For example, I could have put 1% as well. In addition, it is looking at AbdexR. So AbdexR is a meta metric which looks at the quality of transactions. So it ranks all the transactions between the ones which are meeting SLAs and one which are not. And it looks at the ratio. So if all transactions are meeting SLAs, the number should be 100. So right now, it is set up as it should be as good as the previous version. And at the same time, it should be more than 90 as a score. In addition, you can see these two settings which is saying, what is the evaluation duration for this whole deployment? So at every round, it is sending a fraction of the traffic. And that fraction is determined by this number here. And it would run it for an evaluation duration. Once that evaluation duration is done, it is going to check all the metrics to see everything is fine. And if it is fine, it will promote it to the next level. And it will keep going that way. And target ratio essentially determines that from the previous ratio the version one had. And when you bring in the new version, which is version two, what is the target ratio of traffic? The version two should go. And our reload balancer will shift that much. So it kind of caters to a lot of deployments. Like for certain apps, admins have just put ratio to be 100. For very critical apps, sometimes they have said, oh, I want to stop and see everything at 10. And then let the automation go from 10 to 100, for example. So once this is set, and the Canary deployment essentially gets triggered, if the deployment state for any pool is set to evaluation in progress. So it takes kind of the whole run. This configuration is set up to incrementally go by 10%. And it runs each evaluation window is 5 minutes. So it kind of takes the whole thing as one hour. So I had it already go through earlier in the day. And I'll just show you how this looks like. So this is the Canary upgrade. Yes. Let me move Zoom out to six hours. And here you can see that the open connections stayed pretty flat. My script had dropped in between. And it is actually showing the UTC time. It's the normal time. So my script had gone down, so you don't see anything here. This is where all the deployments were going on. So if I click here, you would see that deployment updates happened. So first deployment update happened at time 10.38. And then for every five minutes, it would keep updating the ratios. So if you see here, the ratio was updated to 11. And then the ratio was kind of updated to us. So I have it all listed out here. Let me show. So you can see how at every stage, it kind of kept bumping up the ratios. And you have a full audit trail of what happened as the traffic was being switched. Now you would want to know what happened at each stage. So at each stage, it also captures what was the criteria, the pass-fail criteria, and what were the results and those. So for every metric, it will record what was the value of that metric, and also what was the result. And finally, at the end, it will be successful. And the event kind of captures what was the previous pool in service, and what is the new pool in service, what were the final results. All that is captured as an event. Now this event can be used to trigger other notifications. You can set it up with Slack. You can set it up for email. All that can be done on a personal basis. And I go to the pool group. The final state for this pool kind of shows one version, which is version two, is in service, and the other one is out of service. And then I'd also like to show, yeah. This is kind of on the VNC. You can see how this pool is the default pool, the version one pool. And this was kind of holding the traffic all the way, and then it finally kind of died down around 1040 or so. And this is the other pool, which you can see. It started ramping up traffic from 9.30 all the way, and it kind of goes till 11 o'clock. And then the whole deployment was switched over from the one version of the application to the other version of the application. So, yeah, let me go back to my slides. So with that, I want to also present the case study, which one of our customers actually used RV to do the blue-green deployments and canary deployments in their production, and what were the lessons learned from them in general around the continuous delivery. So the case study is from a leading provider of research, databases, and information. And we have presented this case studies previously, too. So if you want to look up more details. The goal of, like, when they were looking at the continuous delivery as a problem or as a goal, these were really their main issues. Their deployments were very costly. It was very high-risk environment. Large set of changes would be pushed in together, and they would always have these like-built breaks or issues. Even their environments were very inconsistent. They had very unstable development environments. And the production usually would get a lot of care, and there would be no issues in production. And the development would be all mess, because everyone is using different patches and different things. And what issues would show up in development wouldn't show up in production. And a lot of times, what would not show up in development will actually show up in production. So it essentially resulted in basically two to three releases per year for them. And this would result in a lot of changes getting patched together and cause instability in the release. I'm sure this is a very common story, and we have seen it in several customers' environments and in our own development as well. And we could really empathize with that. And this problem conflagrates when the scale is very high, like in this particular case, more than 300 services across 8,000 systems. So the way they were able to get to continuous delivery was they started with what they call automation factory. And the goal of the automation factory was to have everything automated inside it. So obviously, they had a lot of systems, and they couldn't just transition everything in one go. They started with a small set of application as a wall garden where everything was 100%. And there's a lot of emphasis from them around 100% because even if there's one step, which is manual, for example, or you need to go change a config file or touch a file, that breaks the whole thing. So you really have to be able to go from a start to finish, like from a code check into the delivery in 100% automated way. And that's when everything works. And finally, they were able to do releases in two weeks' time and a pretty impressive pipeline. One of the things they also emphasize is you've got to have a stable environment. And in order to get to this stable environment, we'll see that it also needs to become like an infrastructure as a code and very much on the goals of OpenStack. So they started writing down the requirements and some of the things which I believe applies across the board is all deployments must be based off purpose-built automated image template. It's very important. The seed of the image from where the automation kicks off needs to be very predictable and built automatically. Testing must be very automated. Validation has to be automated. You don't want to be like, oh, somebody gets a report and then doesn't approval. And that doesn't scale very well. And one very crucial thing is that there will be multiple systems. And when you are trying and troubleshooting those systems, you want the information to flow into commonplace, like Log Classify, the session before it, or those kind of systems. Because you don't want administrators to go in and log in into the system and actually make the system dirty or they could inadvertently make issues there. And anytime system is touched, it needs to get redeployed and the whole pipeline. So for the infrastructure buildup, what they used was everything was checked into the get. From the get, the manifest information was pushed into console, which is key value data store. And from the console, different parts of the deployment would pick up the information. So for example, Jenkins would pick up what is the manifest file and how to build the application image, whether it's Linux or Windows. And then from there, it would actually push the packages on top of it and then run the services. So all this kind of information is in one single place. And information, that single place is triggered off GitHub. So now that the image is kind of built and it is always the environment is set up properly and predictably, the next thing is whenever changes need to be done, it always goes through the pipeline. And the pipeline starts at get and kind of follows through the different stages orchestrated by Jenkins. So everything from system patches, operating system upgrades, agent or configuration updates, everything goes through this pipeline. And having a blue-green makes it really, really easy for them to do complex upgrades like an operating system upgrade, where everything being the same, the client traffic is coming in to the load balancer, AV load balancer. And then they are able to bring in completely new version and be able to test it without losing their existing environment or tainting it. This is an example pipeline of the top one shows how the success looks like, where it will keep hitting every stage. Promotion happens to the next stage. Validation, deploy, promotion, it kind of keeps repeating. And when there is a failure, they are able to detect failure most of the time, even before the code makes it into the production. So this is like a quick run through their pipeline. So as I talked about the building of the image itself, that was actually done with the Ansible tooling. So it would check out the right packages, build it. And once it is built, it would kick off to Jenkins. Jenkins was set up with the blue ocean pipeline. And all the artifacts are stored in the JFrog Artifactory. And then Jenkins would kick the application deployment. The application itself, once it is kind of built, unit tests are done, a heat stack is created in OpenStack. So the heat stack translates to a pool in AVI. And then the pool becomes part of the pool group in AVI environment. And then like we saw the demo, the automatic blue-green is triggered using the ratio changes. But before the ratios are updated, once it is part of the pool, AVI does the health checking and all the validations on that version of the applications is done, even before that live traffic is switched to it. And with the content rules and with the policies available in load balancer, these kind of operations are very easy to do. So once kind of the new pool comes in in the bottom, you see the priority two. This is actually the pool which is in service. And the first one gets the test traffic, everything goes fine, it is taken out. But if there are any failures, the beauty of the deployment is the previous system is all intact and ready to go and take over the rollback. So recovery becomes very, very nice in this kind of environment. And in the Canary, again, from 1910, they start and they go to 50-50 and then to 100. And finally, the other pool is still there. It is still available. But it's just not receiving any live traffic. And if somebody wants to go do triaging, go look at the logs or harness assets from it, they can do it as well. So with that lessons learned, one testing is extremely important, which is kind of obvious. Do not underestimate the cultural change required. A lot of times, the technology and the tooling is there. In fact, we believe with this kind of a simple API change, you are able to do traffic, live traffic, distribution. Anyone can do blue-green now with this. I mean, we have been able to show the blue-green with so many different kinds of apps. But for many cultures, it may be too much of an automation or too much of a cultural shift to not having to babysit the whole deployment. Again, do not start out trying to automate manual process. I think it is better to start from a wall garden approach where everything is automated and then bring in your applications into it. And buying from all levels of our organization is very critical. This problem cannot be solved in silo. There is always going to be multiple parts in any enterprise. And also, even though we talk in terms of blue-green, but in reality or in practice, blue-green means nothing. Version one, version two are the things most people use. And you can come up with your own nomenclature. With that, thanks, everyone. And if there are questions, we'll be happy to take it.