 So, hi all. This is a lightning talk on implementing advanced deployment using Spinnaker and Kenry. A little about me. I'm currently working as SD2 in Expedia. I've contributed in FOSS Asia as an open source contributor and in my free time I like to like blogs and give talks. So, you can find me on these platforms under the handle champion paddler. So, bad deployments are unavoidable but definitely we can reduce the damage. So, here are deployment strategies we have available in the market. First one is a recreate deployment. The simplest one we just delete all the available, delete all the available service and we launch the new version in the servers. So, we have a downtime in the recreate deployment. Then we have a blue green deployment and basically in this deployment we create an entirely copy stack and we gradually shift the traffic to the new version. Third one we have is the RAM deployment in which we gradually shift the traffic by server by server rather than creating an identical stack. And the most used in the market is the candidate deployment. So, rather than completely shifting the traffic we just shift 5% of traffic or less than the percentage. Have you have questions? So, here are the metrics we can have on analysis. So, late status latency on the errors, then the traffic and the saturation. So, based on these metrics we analyze whether the deployment that we are doing is a good deployment or a bad deployment. So, here are the steps for the candidate. We do the collection part and then we check whether data is available or not. In case there are missing data, we go for the entirely complete data. Then we do the comparison and then we do the score competition. This can be done both automatic and manual. So, it totally depends on the implementation we do. Then there is how Expedia deploys. So, currently we were doing on the blue green deployment. Now we have introduced the progressive deployments that is an advanced version built on the top of Spinaker and Canry. So, a fact that the total loss of Expedia in the revenue due to changes is 25%. So, that is like a quite huge amount for Expedia. And as we are progressing and there will be multiple deployments, these deployments will definitely have issues. So, we need to reduce that issues. So, what is a progressive deployment? So, first of all, it is a similar to Candid deployment. But yeah, the traffic shift we do is in multiple stages that we have in place. And it is built in the top of Spinaker. So, as you can see on the right-hand side, we have two stacks that are available. RCP is the in-house stack of Expedia. Second one is ECS that we are using that is provided by AWS. And whenever there is deployment and any updates, we do the notification on Slack and also on ServiceNow so that we get to know using emails. So, here is a quick difference. So, if we see on the left-hand side that as a Candid, we initiate the step, do a traffic shift. For example, we do a 5% shift of traffic. Then we check and validate like the deployment is going fine or not. And we do the judgment and promote 100% traffic shift. Or we do a rollback if there's a fault even in the progress deployment rather than like doing this judgment. Only once we have do that in multiple stages. So, that's why that is a progressive. So, you can control the stages whether you can do for five times or ten times. It is totally up to you how much percentage you want to shift. So, here is a demo. Let me show you a demo for that. This is a demo for progress in deployments. This is my time and now I will deploy a faulty commit. So, here are the steps it will deploy to LAP and it will deploy into and so these are lower environments. After that, here is a step of progressive deployment execution. Here will be shifting traffic into two shifts. First will shift 25% traffic and then we will see if we are facing any errors on dashboard and then there are no errors. Then we will move to 50% traffic shift. So, as here, there will be error on 25% traffic shift. So, it will be rolling back. So, let's see how it happens. So, right now deploy LAP stage. Next, wait till it moves to progressively deploy execution stage. So, now it has deployed to LAP and our integration environment. This is a step for progressive deployment execution. It has started and this is a 25% shift that is taking place through multiple steps. And over this step, we will be shifting the traffic until this is doing other operations. So, let's see on the dashboard if we see any errors. So, this is the TPS and these are the PD matrix. So, if we see on all of the three primary, base, and candidate, there are no errors coming up because right now, traffic is not shifted. So, we will see how it will change when traffic starts shifting on this step. So, this step will be waiting for total 25% traffic shift. So, now 25% traffic shift has done. If we see, can we has like 38 errors? So, we can see in the new deployment, we are facing the errors. Whereas, for primary baseline, there is no errors coming up. So, like previous version was stable. New version has faulty. So, even can you versus baseline, we see, can everyone has more, has the error versus baseline is stable one. Now, all the pipeline, we see, we have moved to, we get two options, approved for project as 25% traffic is shifted. We can either approve and it will move to shifting 50% traffic else we can project and it will move to the previous state and then it will go to the ruleback, ruleback. So, let's reject as our commit is faulty and we are doing manual judgment over here and let's go to reject. Let's reject this. Yeah. So, now it went to the previous state and now it is going to the ruleback state. Now, it will be doing the ruleback task that we have defined on that stage. So, this is how TD works. Multiple can this take our traffic ship over here and we can approve or reject at every step. So, this is the major difference between PDM can be. Yeah. So, Expedia has also open source projects. You can check it out. Open source projects by Expedia. Thank you. Any questions? Yeah. There is a ruleback stage. Sorry. Yeah. We go back at this stage. After failure, it will be going to ruleback stage. So, we have existing stack available. We will just roll back. So, any more questions? Yeah. It is a custom plugin by Expedia. Right now, that is a debate topic we are having in the organization only. So, as we have a very high traffic and even a traffic shift of one and two percent that gets affected, we have a very bad customer experience and we are like handling very like high customers. So, right now we are moving with manual judgment only. But, we are experimenting that we can go with automating it. So, hopefully in the future we will be moving to a complete automation. Yeah. So, for our service, actually we are right now currently migrating to RCP only. So, we are right now in the shifting stage for RCP. Then we will be implementing this one, Progressive Reliance. But, for other services, I think they are currently using it. So, like 326 services are onboarded on the Progressive Deployments. But, I think with this, we don't need to have a ruleback. That is the test we have. So, RCP is a custom infrastructure that we build up on top of Kubernetes. So, right now what happens is for multiple services, we get allotted a number of instances. But, we are not utilizing them completely. So, if we take large instances, like MX large, that is one of the instances on AWS. But, our utilization is shifting because in the PST time zone, we have the huge traffic. But, when it comes to IST time zone, we have the least traffic because most of the customers are in the U.S. only. So, in that cases, and also for services which are like Taiwan services in Expedia, we have fixed number of instances. So, we don't de-scale. So, we can see that there is infrastructure cost that we can reduce definitely. So, RCP is infrastructure layer that mandates and controls and automatically scales services basically. So, that was the main purpose of introducing RCP. So, everyone follows a fixed standard and resources can be utilized properly because infrastructure cost is quite high. So, we are trying to reduce that using RCP. Yeah, Spinnaker is centralized. As you can see in the, yeah, that is centralized and all the handle using application names. So, you can go over and search your application.