 Hi, all. With this session, we look at how Intuit manages traffic dialing at scale within ServiceMesh. So today's agenda is basically like, I'll give a quick intro about Intuit. And we'll see the problem statement and how the things get complicated when we run at scale and the solution which Intuit is using and how we handle at scale with that solution. So today, I have Venkata along with me, myself, Nandan, a senior software engineer, and Venkata is a staff software engineer. We work for a ServiceMesh team at Intuit. So let me give you a quick intro about Intuit and what we do. So Intuit's core mission is to power prosperity around the world. Basically, we are transforming from a tax and accounting software provider to a global financial technology platform where we do the hard work for our customers. With over 100 million customers worldwide using TurboTax, Credit Karma, QuickBooks, and MailChimp, we believe that everyone should have the opportunity to prosper. So that's a quick intro about the Intuit. Jemping on to the today's talk, let's see the problem. So coming to the infrastructure at which Intuit runs, Intuit manages around 250 KHz cluster which form the multi-cluster Istio ServiceMesh. And we run around 7,000 namespaces which includes around 77K nodes and which runs around 9,000 services deployed across multiple regions in the KHz cluster. So that is the scale at which the services run. So coming to the problem statement summary, at Intuit, the services evolve at a rapid pace. This rapid pace development can be due to the infrastructure changes, the underlying infrastructure change, or maybe the service owners are upgrading their services to a new framework or a technology or moving from a bare metal infrastructure to a cloud or from cloud infrastructure to adopting the service mesh. And also ownership of a service can change across teams, which might result in the services needs to be moved from a different one cluster to another. So these are some of the scenarios where the migration use case can happen where the service has to come up with a whole new stack in order to support this new framework or this new infrastructure. So the challenge here is this traffic migration for the newer service or the workload. So basically wherever we are saying services, we mean it has a workload in the industry standard. So basically whenever we have a use case of traffic migration to a newer stack, it has to be progressive. Like we cannot directly deploy 100% of the traffic to the newer stack. We have to deploy it progressively. So that is the challenging part. Like how will we do that? So let's take a example where let's say there was a service A which was running with a V1 version and now they came up with a new stack that's called V2. And there can be a scenario where like the V2 version also runs within the same case cluster where V1 was running, but on a different namespace or the V2 stack can run on a completely different cluster. So here, let's say V1 is running on cluster A and V2 is running on cluster C, but they're on the same region. And also the other case can be the service A V1 is running in cluster A and service A V2 is in cluster C, which is of a completely different region. So these are some of the ways where the new stack can be deployed in a different ways. So how will things get complicated as we like get into migration use case? So let's take, we have a service A V1 which was being like used by client A which is basically a dependent of it. And let's say initially the client A was pointing to the service A V1 by using the internal service endpoint that is service A V1.internal. Basically, this is an internal mesh endpoint which client A will be using to reach out to the V1 stack. Now let's say service A V2 has been created to support a new framework or the new infrastructure. And with this, the client needs to now migrate their traffic to the newer stack. So how will they do that? The common method that usually the service or the service owners follow is to update their code base or the config to point to the new V2 stack. So now let's say like the client A has modified their endpoint to reach out to service A V2 by using service A V2.internal. So this is easy when we are seeing in terms of one client calling one particular service. But let's take like there are multiple clients that were calling service A and now all the clients has to migrate to V2. So now like every client has to update their config repo or their code base in order to point to the new stack. So this would have been easier if this was a use case with the usual DNS. So let's say like whenever we have a migration use case with the DNS we usually do a cutover by changing the underlying C names so that the traffic migration is seamless. So that is one way. But with mesh that is not the case. The with mesh the configs has to be available on the client so that they know like where to which service where it is located and how they can reach out to that. So that info will be available on the each of the client so it's not straightforward as we see with the normal DNS cutover. So that comes and leaves us with the operational challenges as well. So the operational challenge can be let's say like the service teams are located in a different geographical locations and let's say now they have deployed a V2 stack the service team has deployed a V2 stack and now let's say some issue was identified and they want to roll it back. So now the clients might be residing at a different time zone. Now they have to communicate it to them so that they'll update their code reports code reports to point to the older stack endpoint. So things get complicated here. So how will we manage that? Cause the control is not at a centralized place. It is scattered across multiple clients residing in different time zones managed by various teams. So that is the operational challenge which we face here. So this thing doesn't stop here like the things will amplify at scale. So let's take a scenario where the service a V1 and V2 has been running and the client A has a requirement that they don't want to migrate 100% of their traffic to V2. They want to migrate only 10% of the traffic to V2 and monitor it so that they get the confidence that the new migrated service is stable to use and then they can gradually migrate it to 100%. And also there can be a use case like a client B wants it to migrate 100% of the traffic to V2. Maybe it is less critical. That's why they want to give a full shot and try it out like how things are working and also there can be one more case where the client C is a critical service and they don't want to take a risk. So they don't want to migrate it at all until the new stack is called out to be stable. So there can be multiple use cases like this and supporting these use cases for us clients that are residing across different clusters across which is scattered across and managed by different teams is not easy. And this we have considered just one service which is required by multiple clients and the same use case can be with multiple services across multiple clients across the entire into it. Like let's say the migration can be happening for service B, service C, service Z which is in different different clusters and those are being used by different different clients across different clusters. So how will we manage all these configurations like who will reach out to like which service with what percentage of traffic needs to be routed to which particular stack. So all these things are challenging like how we'll manage everything. So that comes and leaves us with how are we doing it and what is the solution for this. So to explain that I would like to welcome Venkata. So he'll talk about how we are solving this particular problem at into it. So now that we have seen the problem and the scale at which it can cause operational challenges. So let's see the solutions that we adapted at into it and the path to the final solution that we have designed. So the solution is traffic dialing. So what is traffic dialing? So traffic dialing is the ability to selectively shift traffic across different service stacks without impacting clients, but how do we achieve this? So let's take an example of a single client talking to a single service. So one possible solution to achieve traffic dialing without really disturbing the client is to ensure that you plant a virtual service at the client side so that it continues to use the same endpoint, whereas the virtual service has different rules to dial the traffic between the two service stacks that the client talks to. So let's see how it works. So client A is deployed in cluster B and client at the service A, which is running V1 stack is running in cluster A and the client is configured to reach out to service A using an internal endpoint, services.internal. And at the moment, the service, the client A dials 100% of its traffic to V1 stack of service A. However, when service A deploys a new variant V2, a new stack for service A has come up. Now, if at all the client A wants to dial traffic or in a weighted distribution to both the variants, then we plant a virtual service without changing the client by altering the routing decisions based on the whites that the client configured. So in this example, it is 80 towards V1 and 20 towards V2. So that should solve the problem for one client talking to a particular service with two variants running in parallel and doing a weighted distribution. So let's see how this gets complicated at scale. Now, there is a service A residing in cluster A and client A in cluster B, which dials 100% of traffic to service A. So service A deployed a new stack variant identified by a different color. And then service A decides to do a weighted distribution between these two using the virtual service, the way we have seen in the previous slide. Or the client A can even route traffic or dial traffic between stacks V1 and V2 using APIs. So it is not necessary to dial traffic using weights. So any traffic pattern can be used to dial traffic across different service stacks. So let's see how things get complicated. So we have client A and client B dialing traffic to service A and service B in the following fashion. And let's say client C is a new client that has come up. And at some point in time, so along with client A and client B, say client C wants to move the traffic to service A. However, let's say the client A and client B are not satisfied with the functioning of service A, the V2 variant of the service A, and they would want to roll it back. So they need to once again modify the virtual services and then do the change to reroute the traffic back to the V1 variant of service A. So we have two problems here. One, the owners of these clients, they should know the patterns that virtual services support to route traffic. So it could be based on weights, paths, or a combination and its evaluation, semantics, et cetera. The second one, the changes have to be made every time a service rolls out a new version or rolls back an existing version. So how do we try to handle this at scale with Intuit running hundreds of services and which are consumed by hundreds of clients? So how can we do it? How can we provide a decent solution so that is so user-friendly? So at Intuit, we have different clusters in which you have clients and services and we have a global central component called so traffic dialing service. So the traffic dialing service is all aware. He knows where the clients are running in which cluster and he knows where the services are running and their clusters, regions, everything. And the dependency graph between clients and services is also made aware to the traffic dialing service. So with traffic dialing service sitting in a centralized location and then having the entire mapping of communication between clients and services, a service owner can simply open up a UI and then decide how to dial traffic from different clients to different service stacks for each of those clients. So but underneath the global traffic dialing service would create a virtual service. So in this example with two service variants, V1 and V2, he would create a virtual service to route traffic for all V1 APIs to reach to stack V1 and V2 to stack V2 and it would implant them in the clients that are configured using the UI. So with such mechanism in place, so the client A can easily dial traffic from V1 to V2 and the advantage here is that the owners, the service owners, they need not know about the evaluation complexities that the virtual services offer. So what they see is a simple UI in which they can mention a criteria and then dial appropriate service stack that they want to. So with this in picture, so let's redefine traffic dialing. So traffic dialing is selectively shifting traffic across service stacks without impacting client from a centralized control plane via a self-service portal, eventually making the process platform agnostic. So let's see the centralized UI with a small demo. So the setup for the demo includes a service A, two clients, so everybody resides in a different cluster and to begin with the client, both client A and client B would dial their full traffic to service A and then we have a new service A stack coming up, so which is depicted by different color here. And at some point in time, client A and client B would dial a portion of traffic to service A and finally they move to the V2 version of service A and when we mimic a case wherein the V2 stack of service A is rolled back and the client A and client B would eventually dial their traffic back to the V1 variant of service A, okay? So what you see here is the self-service portal of the service A and then, so you can see a tab called traffic management. So in traffic management, you can see that there are two clients, client one and client two, they are dialing traffic to two service stacks which is V1 and V2 and you can also see that as of now to begin with, 100% traffic is going towards service stack V1 from both client one and client two. So let's see client one. So client one is a 40 year job running. So we have used port forward so that we can open it in UI and all the logs coming from client one to different stacks should be visible in this plunk. Similarly, client two is another 40 year job and the entire logs coming out from client two should be visible here. We should be able to see where the traffic is getting redirected to. So at this stage to begin with, with 100% dialing happening towards V1. So let me start the 40 year jobs of both client one and client two. So let's see where the traffic is going to in client one. So no surprise, it should go to stack V1. Yeah, you can see that the 40 years configured to trigger nearly 100 requests per second and everything is going towards V1 stack. And then we should see a similar behavior for requests coming from client two as well. So all 100 requests from client two are going to V1 stack. Yeah. So we'll go back to the portal, self-service portal and then we'll do traffic dialing for client one alone to see that both client one and client two are adhering to the latest configured rules. So it is as simple as editing the weights across different stacks. So let's make it 50, 50 and we leave the client two as is, then we save. The portal should reflect the change. Yeah, so we see 50, 50 traffic dialing configured. So underneath the central service that we spoke about, so it would have created appropriate virtual services for the client wherever it is residing to dial the traffic across different service stacks at the configured ratio. And client two, we did not modify so we should see that the client two which is talking to the same service is sending full traffic to V1. So let's see the client one logs, let's see where the traffic is going. Yeah, so you can see it's a near 50, 50 distribution going to both V1 and V2 stacks and then client two should dial the traffic to V1. Yeah, so the 100% traffic is going to V1 stack. So at this stage, let's modify client two as well to do some dialing and let's configure different weights to see that the routing is per client. Let's see if the portal reflects the change. So the portal has reflected the change which means the underneath virtual services are modified by the global central component. And let's see where the traffic is going to from client one. So from client one, it should be 50, 50 between V1 and V2 stacks. So it continues to be and we should see an 80, 20 distribution for client two. Yeah, so V1, 80, V2, 20, more or less, yeah. We see 80, 20 as configured in the self-service portal. Now we go back and then at this stage, let's say the stack V2 is stable and both client one and client two would like to dial 100% traffic to V2 stack, yeah. So let's save it and make sure that the portal reflects it. So if we take a look at client one logs, yeah. You can notice that 100% traffic is going to V2 stack and nothing to V1 after the change that we made. We should see a similar behavior for client two. Yeah, so everything is going towards V2. So at this stage, if the service owner of service A decides to roll back the V2 variant that he has deployed. So all he needs to do is come back to the self-service portal and for all the clients that are talking to service A modify the weights so that they reach to the previous variant which is the V1. So he can do so by modifying the weights for each client from the self-service portal, let's say 100, zero, yeah. Save it, ensure that it is successful, yeah. So with that, once again, the global component, so he would once again go ahead and modify the virtual services for all the clients according to what is configured here in the UI and we should see the result in our Splunk. So everything is back to V1 for client one. We should see a similar behavior for client two as well, yeah, yes. So this is how we do dialing for multiple services at scale and we allow different traffic patterns to be configured as criteria to dial traffic across different stacks using a self-service portal at Intuit. Hope we could explain how at Intuit, we extend traffic dialing as a platform capability and abstract it from hundreds of teams that manage different services from knowing the internals of service mesh or the traffic routing complexities involved with a simplified user experience. That ends the session, thank you.