 Whoa, it's five twenty five so many people. Thank you for joining us. I hope we will make your time here worth it I promise we will not sell anything. We just share our experience My name is Yacin Simeonov and I'm product manager at Intuit in the looking after the developer experience And I'm more focused on the service mention tape a gateway and yeah, Henry. Hi everyone. My name is Henry Plixt I'm also product manager at Intuit and I'm also one of the the Argo maintainers So we're quick today going to talk a little bit about how we use service mesh and how it looks at Intuit I'm going to talk a little bit about progressive delivery why we use it how we use it and then I'm going to talk about How well they work together or how well they didn't work together? And we kind of what we experienced and realized as we try to use these together a little bit look into the future what we think is coming next what are the big focus areas for us and then we're going to round off with a Yacin doing a live demo So Very quickly But into it since we're in Europe not too many of you probably know who we are because we have 95% of our business in in the US But we're one of the leading US fintech companies. We have roughly 10 billion dollars in revenue What you probably know is more for If you knew is that are the Argo project came out of Intuit Where it came out of a startup that we acquired and then we later donated Argo to the CNCF We're big supporters of open source not just Argo But there are a number of other projects that we contribute to that we use basically If it's in the CNCF you've probably all seen that massive map, but if it's on that map We pretty much use it and we contribute to many of them as well Okay Yeah, before going to this light I just out of curiosity how many of you are here because of the service mesh part can how Great. Yeah, great. And how many because of the progressive delivery and Argo? Whoa, great. Okay, excellent. I think these lights Shows our scale and it is important for the service mesh part because you can see that we have around more than 230 Kubernetes clusters and the service mesh that I'm going to talk about in a moment stretches across all those clusters So we have one single mesh across all those clusters rest is like Dynamic so our not count is growing depending on the period We have some peak period where the count grows and some quiet period when the car count goes down Yeah, so we have more than 7 million pots running in our infrastructure Okay, so why service mesh so we are a FinTech company and For us it is very important security is very important. So we have these zero trust Infrastructure so in this particular example service. I cannot talk to service be without proper Authentication and authorization so what we did in the past we have this API gateway that you see on the top and Service I cannot talk to service be we apply Kubernetes network policy that Services can talk only to the API gateway and this is where we apply All the security all the TLS everything that that is needed and tall advanced traffic management However on my slide the API gateways are just a circle However, just for your information in the peak period we process more than a million transactions per second So in reality, it is not just a circle. It is like a huge Set of replicas of the API gateway and yeah, it is a complex environment With growing of the service mesh we figure out that for us. This is the right way to deal with the security and though with the traffic management and that's why We started moving different services to the mesh. So For the services we add this Envoy proxy and now the traffic can flow between the services directly From service a to service be without hooking to the API gateway So when you see at this point all services have two endpoints the initial endpoint is the public endpoint through the API gateway and Service be he has like service be dot API dot into it dot com But service be now because it is mesh enabled has second endpoint which is service be dot mesh for example and the client of service be in this particular example service a can decide I want to call service be through the API gateway with the API service be dot API dot into it dot com or through the mesh So this was a step from the migration So when when we start switching the traffic and if the services don't need to be exposed outside of our network For example service be and service see in this example They are just internal services and service a need to be exposed outside. We Offboard those services from the API gateway and now all service to service communication happens through the service mesh and Services that need the external exposure are also on board to the API gateway. I Think many of you think that the Envoy proxy at Latency in our case it is completely opposite Because we don't have to hook the traffic to it is a completely different VPC completely different account for the API gateway now the traffic stays either local in the cluster or goes directly from cluster to cluster We see more than 30% latency improvement on the service to service communication Again in this particular example that the communication is empty less encrypted Service a presents its identity to service be service be presents its identity to service a and they establish this communication channel So The way that we implement the API gateway it is front-ended by Aob this means this is a public network So by moving the traffic the service to service traffic to through the service mesh we keep the traffic in the private network We significantly reduce our public cloud provider cost with this And yeah, so one hopeless for the communication also We were rotating certificates and keys for the services in order to communicate with the API gateway manually every 90 days Right now with service mesh it is matter of hours and it is fully automated. So developers don't have to rotate certificates and keys Also in the past when the traffic goes through the API gateway the service owner can Can have observability from the API gateway to to their service right now? We have transactional visibility we can see the entire chain from service a to service be to service C. We have this observability So this is overall our architecture arrest I will start from the bottom. So In in the Kubernetes pot we push the envoy proxy and we have one additional One additional container, which is this third agent wrapper in our case Our certificates are the identity of the certificates are not speaking compliance. So we have different ways for the certificates That's why we have our own implementation of the of the certificate rotation and we pull the certificate from our certificate authority So this is for the for the data path for the control plane We have is to do per cluster. So control plane per cluster and On the right side, you can see the certificate authority We can see you can see the admiral which I am going to talk about in a moment and of course integration we target So this is more central component is to this per cluster Just want to mention about admiral If you remember we have service a service be service C However, those services may be in different Kubernetes clusters And if you remember we have is to deeper cluster. So we have to find a way how to enable service discovery across the clusters and admiral is another open source project that Intuit then donated to is to ecosystem now with these two going to CNCF Hopefully admiral also will go to CNCF and admiral is responsible to synchronize the Endpoints to do service discovery across the cluster just One example what we did initially for our developers. We let them Manually on board their services to service mesh They go to the dev portal they click enable mesh and we send them PRs with the Istio label and their services on mesh However, what we did recently we made this day zero experience if our developers create a new service We are not asking them. So the service to service communication always happens through the mesh. So API gateway is only for external communication Okay, that's it from me. Okay. Thank you. Thank you. Yassin. So just an interest how how many in here know what progressive delivery is All right, so half maybe so I'm gonna give you a teaser on why Progressive delivery is really cool. Why we use it into it if you want to know more about progressive delivery Please come by the the Argo booth over in the project pavilion and me or one of the maintainers would be happy to give you As much detail is would possibly want So one of the main reasons for using progressive delivery is it into it where from the platform team We build the developer experience platform for our 5,000 developers So one of the main reasons for this is is increasing operational excellence because we also Run all the services, you know for all into its almost hundred million customers so One thing that we do a lot is introduced change into the system. So this could be Someone does a configuration change. Someone deploys a new version. Someone deploys a brand new service Every time you introduce change into the system There is a chance that something might go wrong When that happens We want to make sure that we minimize the impact from that change but using progressive delivery We can actually roll out, you know, your new canary your new stack to a subset of users do some testing Increase the number of users that see that canary. We actually reduce, you know, the blast radius from anything that might go wrong The other one is when something happens it takes a while, you know, something goes wrong a user finds out They call back some developer gets woken up and you know You start trying to troubleshoot what's going on and then eventually figure out what the heck went wrong and then you start Fixing it right in that whole process. I think most of us has been through that takes a while And that just leads to the long m2gr long mean time to recovery So we're using the automation the progressive delivery offers We can increase the reliability of the process and also reduce the time because we can automate a lot of that so if something happens during one of those steps would just automatically roll back instead of Put it putting everything out there wait until something happens and use the calls in will just Automatically run a bunch of tests if they fail boom, we roll back right away So that just helps this troubleshoot or you don't even have to troubleshoot you just gets us back to a known good state much faster The other one the other thing we noticed when we talked to some of developers was that they They spend quite a lot of time figuring out if it changed the maid if a release they did it's actually healthy They build their own custom dash dashboards. They look at logs They look at other dashboards, and they just sit there babysitting their their applications to make sure that they're actually good And they spend a fair amount of time doing that So by automating this process building more Reliability and dependability on this process we can free up time for these developers They don't have to babysit and make sure that everything is working because we're doing that automatically All right, so they can deploy rest assured that the the automated tests Fill the function of the dashboard and then they can you know keep doing what they they they want to do Which is you know coding rather than than babysitting applications So as a surprise to no one here probably we use our gorillas for progressive delivery and Just like y'all mentioned for service mesh. We've made it a day zero experience. So when a user comes in Interdev portal creates a new service they get automatically onboarded with our go rollouts We still have a large number of services that were created Before our go lots and for them we have a an opt-in migration experience that we built So they were not forced into migrate, but there's a very simple simple path to to get there And of course everything is stored in git probably don't even have to to mention that these days And then rollouts we deploy into all our 250 to 160 clusters and then it's managed by by the Argo teams and we've kind of divided up in waves depending on the business units We have so the business units depending on how Risk averse they are might be in a later wave and we start no wave one is pre-prod And then we soak there for a bit and then we roll down it roll down the wave So once we hit, you know, the the really business critical ones, you know, we know that everything is working fine so This process and this getting service mesh and progressive delivery working together wasn't completely free of challenges So one thing we noted Initially was that we need multiple traffic providers and initially Argo rollouts didn't support one-on-one traffic providers So having east-west traffic and northwest traffic coming in in into different different paths meant we couldn't really track metrics for our progressive delivery from two different to different those two different paths Once we fixed that in Argo rollouts and actually now support that we realize that It's it's actually pretty hard to figure out what metrics to use what aggregations to look at when you figure out the health of your application. So Now we're actually the goal is actually to to make sure that we're using service measure everything So even if this is actually supported way in in rollouts now It actually makes it harder for the service teams to figure out how if using the HTTP request metrics Which one are actually the ones that they tell me a better story of my application health So we're looking at using service measure everything because then we have a single aggregation point through a service mesh Another thing we ran into was that we were using the admiral tool that you also mentioned earlier and it actually Creates the endpoints as an abstraction over the the Kubernetes service names Unfortunately Argo rollouts depends on the Kubernetes service names. So that was something, you know We realized as we're starting to test is like hey this this is probably not working. So we had to fix that it's now fixed and working How many in here things that migrations are fun Okay, we have we have five people that work for migration tool company But we're now looking at two migrations here I'd be talking about service mesh and progressive delivery and we have 2,000 services that were there before we made this a day-zero experience and telling them to do not one migration by two Didn't really raise that much excitement So we spent a lot of time like like yaltsin said as well making it a super simple Experience like in a developer portal is a drop-down. I don't know if you can see that maybe the first two rows came But for you guys in the back There's a drop down that says enable mesh and another on the other one drop That says migrate to progressive delivery. So just by using going into a dev portal drop down click You can move your service over to progressive delivery or or service mesh We've also run a bunch of programs internally They're calling a fix it day or game days where we basically, you know Put up a challenge for the team the team that migrates the most services during a day You know they get they get a prize and we do check checkpoints every hour and just kind of make it a competition and just Encouraged the the teams to do this migration instead of coming in You know with a big hammer like thou shall migrate or just trying it to just spur them to do it organically instead Also super important stopping the bleach the first thing we did was the date zero experience and then the migration to make sure that We're not just onboarding more services that we don't have to my guess Let's let's stop the bleed first and then you know, we can migrate what's what's left in the in in the infrastructure I already mentioned this a little bit but config in those analysis templates are actually It would actually be challenged even for the service teams that that know the services because we have a wide variety of services a Lot of our software is is tax preparation software for some reason people like to do their taxes two days before Tax filing deadline. So we have some services that have a very strong seasonality We have other services that are more geared towards small business so they run all year, but maybe peak around in a black Friday or not. So we have big differences between the traffic that comes into these services some have Just the same amount of traffic all year some have almost all the traffic in two weeks And then there's almost nothing. So if you're going to do an HTTP based Request HTTP request based analysis. How do you do that when you basically don't have any traffic in the service for for three three weeks? right and so We spend a lot of time with our teams to try and figure out what the like the 80 20 90 10 rule and figure out What's the best default that we can do and then also make sure that when we create these templates and clone them into the The git rope repos when they create the service That's also an easy way for them to to change those values in case they need to customize them with custom metrics or change The values that that we provide Are we doing time wise? So what's next there was some of the other things we're looking at Traffic mirroring is supported in service mesh. We have some services that don't even want to do Progressive leave it all they want to make sure they do napple staples comparisons So traffic mirroring traffic shadowing traffic cloning where I want to call it basically doing a A new path where you send off a certain amount of production traffic to stack that that your users aren't on right so that You can test those two stacks against each other without really affecting any users So we're looking at how we can use that for for Argo rollouts Argo rollouts can do the canaries based on on pods and just do a weight basically or we can do Percentage traffic in a 10% traffic scale up to 30 would not but there are use cases where we want to do something a little bit Smarter and I do it on maybe some people from a certain region or some that uses specific browser whatever Go to the canary right so so we're also looking at using header based routing So we can be do a little bit more intelligent canaries rather than just a percentage based of the traffic I think the biggest thing that we're that we're really excited about is anomaly detection and observability Because even if you have these analysis templates run these analysis before we do our progressive delivery It's still you know a fairly rudimentary measurement of just doing a metric, right? So what if we can do something a lot smarter than that? We have a lot of data scientists We do a lot of ML it into it What if we can take our all our operational data and utilize some of all that ML knowledge We have and and figure out some things a smarter way of measuring if if the rollout is progressing the way should rather than just looking at You know CPU usage memory usage or HTTP requests, which are still Pretty pretty crude, right? So if you can use ML to figure out an anomaly score or a good measure of how the application is doing That's a much better way of and will give engineers even you know greater trust in the system If if it's healthy or not and you can take that even a step further why stop at At the progressive delivery. No if you can figure out When your application has an anomaly why wouldn't you want to run that all the time, right? It might not be part of the The delivery process but it might be something might happen once you've fully promoted to a hundred percent What about something happening two days down the line, right? You would still want to hey something Anomalous happened. I want to roll back So that's also something that we're looking into how can we how can we not just make the the rollout of a new version better? But also, you know increase the reliability and and fault detection for the services that run in our system After we fully promoted the canary And I think with that we're going to switch over to to a live demo Great other than normally detection my favorite is the header based routing because you can do dark releases So we can do during the tax feeling only Henry's taxes to go through the new version, right? Okay, so we knew that there will be some interest although yeah, the room is like really full We would like to entertain you a little bit with live demo. Hopefully It will all work. Let me quickly show you what I have in the demo This will be probably slow. Let me Explain what the application that I'm going to deploy so people that are experienced with With our go probably have seen such them or have done this for themselves, but I just want to Go through it one more time so I have three notes and In the cluster I have our grow outs and I have is to in a demo mod and The issue version is the latest one that I think is available right now 13.3. I will go back to this in a moment meanwhile, I Would like to show you the application that I'm going to deploy very simple So this is version one that I will deploy now initially and version one just says hello Valencia. That's it Once this is done. I will I will deploy version two and I artificially added some errors so Other than I added hello Valencia with an error and actually I sent the HTTP 500 status cop this means our grow out need to figure out that something is wrong can throw back and When it rolls back, hopefully I will deploy version three which is again healthy version So the status code is 200 and the message should be hello Valencia from into it. Oh, okay Yeah, I will fix this So the our grow out version is again the latest one I will create a namespace called kube-con. I Will label it so to be Enabled for Istio Actually, let me let me I will run a curl that will print the body and the status code So right now there is no application deployed. So it is 400 and I Will deploy the first version so the application responds to this FQDN and I am deploying the first version. Okay, good catch. Can you send PR, please? Okay, it takes some time Yeah, this is a problem with the live demo. You cannot rely on the fluctuation in the internet speed. So Okay, so the first version is deployed. Let me see Okay, so I have the I have two pots running and you can see it is how okay in typos with status code So what I have in the row out here So you can see the strategy is canary and I want I want dynamic analysis and the canary Service name is this one and the stable service name is this one and the progress will happen like this 10% of the traffic Will be directed to the canary version and we will wait for one minute to make analysis Then if everything is good, we'll progress to the 30% we will wait for 30 seconds Then the half of the traffic will wait for another 30 seconds And you see that we are doing traffic router routing with Istio and we modify the virtual service I will actually show you the virtual service in a moment So I will deploy the second version Yeah, this is probably something that takes time Please hold on. I'm sure it will happen. You can see that so now the new version is Hello Valencia with status with error and status code 500 so You can see the virtual service in Istio right now has 90 wait 90 for the primary and 10 for the canary version And this is modified by the bar right by our grow outs So now if we watch here My configuration is to wait 60 seconds or one minute before start the analysis Right now we have 43 seconds 44 so it should go to 30% in a second and start the analysis You can see from time to time I'm getting this error so Yeah, the analysis just run and my configuration is if it need to do to go three times And if the three time it fails it will it will fail back on the fourth time and we'll go back to the original version of the application Yeah, so now it is aborting back and It will terminate this pot and you can see all the messages now going back to the original version with status code 200 and Again to show you the virtual service in Istio So the the wait for the new version is zero and all the traffic goes to the original version I will deploy now the healthy version three Hopefully if you take less time than earlier. So this one if I go back Here you will see that This one sends status code HTTP 200 so it should be successful and it should say hello, okay, Valenisa from into it I run through the demo just before the session started and it wasn't that Slow So this probably we need the time for questions, but we will stay longer and we can chat after this And have about three minutes left. Yeah, it's done. So this is the last version Thanks any questions So this will progress in a moment to complete but yeah, we can we can take questions now I think And yes, and you're you're using Argo and Istio to keep your application up to date What are you using to keep Argo and Istio up to date? Yeah, so we are using cargo CD to use this up to date So we go 100% gtops. So our our configuration is in gtops and target CD synchronize the configuration for those tools I knew a question up. Yeah question up. I repeat the question Yeah, because if I add one more component here, it will take more time So if I add a target CD and send PRs to github with you at one more component that cannot latencies I just wanted to focus on our grow outs and service mesh Otherwise in reality, we are using cargo CD. Yes. Yeah, another question down there. Yeah So I hope I have a question right now we We look at the HTTP status code, right? and if it's not 200 we roll it back and it's that's possible to Look at some other metrics like a cough cough Q nans or some speed this metric is now payment success rates I mean for so for the for the analysis templates You can basically use and anything that can be served as a metrics I mean we go to Prometheus So basically anything that you can serve as a metric through for me, so you can use to base your analysis templates on Just cool. Thank you Any other questions? Yeah, great Does your service mesh cross like VPC boundaries or cloud boundaries and if so like how you managing that like Cross VPC cross cloud connection. That's correct. We have Kubernetes cluster per VPC or EKS cluster per VPC and We have also is to the instance per VPC. However, it is the same mesh across all clusters Yeah, and I think under the mesh we are using Transit gateway for connecting different VPCs and actually this succeeded Eventually Thank you Yeah Do you have time for one final question? Yeah, I still have audio So let's do one more you use our CD to deploy to deploy our rollouts But what do you use to deploy our city and keep it up to date? Argo CD So we so we have so we have we have a number of Argo CDs Like basically what each one of our business entities has their own Argos CD And then we have an Argos CD that manages the the other Argos CDs So what's what's the initial arc CD? How do you deploy it? How do you play zero zero day our city? Sorry, like who is the first? What is the first target CD is the question right? Yeah, so the he means when you first bootstrap the first ever Argo CD. How do you go about? During the first one said when we started Argo when we did the very first one Like the bootstrap Argos CD that you used to deploy all the Argos CD clusters Yeah, so that one has been there for a while and I I wasn't there when I was done But I assume that was done manually. I don't know to be honest actually Yeah, we have we have Argo CD as a add-on for the case cluster So when you deploy the Kubernetes cluster, it goes pre-built with Argo CD and all the services that you deploy in this cluster Already rely on the Argo CD cluster with the with so it is add-on in the Kubernetes question For new clusters it gets yeah, it's part of the bootstrap, but the original one is Thank you. Thanks. Thanks everyone