 Thanks, everyone, for joining the session today after a break. Hope you had a nice break. So we are going to start the session now. Let me introduce myself. I am Sumit Mathur, engineering manager in Intute of Traffic Team. And with me, I have Shushant. He is a traffic leading Intute. And our topic is scaling service mesh, self-service beyond 300 clusters. So we are going to talk about what is scaling means here, what is self-service means here, what is that 300 clusters. So we are going to break down all of this in the next 30 minutes. Let's get started. What is Intute? And what we have done. So Intute is a fintech company based out of US, and it has accounting and tax products. And we believe in open source. And some of our open source contributions are like Argo, admiral. And at the end, we are user customers also like Kubernetes, Istio, Prometheus. This is the scale at which we operate, like more than 300 plus Kubernetes clusters, 16,000 plus name spaces, more than 2,000 microservices. So it's like a factory of microservices that we have, and then more than thousands of teams with 10,000 developers who are using our platforms. And we are on the journey of modern SaaS platform and building a lot of abstraction so that we ease the life of the developers. From the traffic team, this is our contribution to the Istio community, the admiral. You can check more about admiral on the GitHub link. But at a high level, admiral allows us for the cross cluster communication and the service discovery in service mesh world. Let's before we go deep dive in like what's today's topic and everything, I just want to make sure like we all understand some of the basic terms of traffic. So may I know how many of you know API gateway? Do you know what Istio is? Nice. Do you know what service mesh is? Thanks. So I see a good amount of audience who knows these terminologies and there are a few who don't know. So let's start with that understanding those basic terminologies of traffic. So this is our agenda where we are going to cover what is API gateway, what is service mesh, what are the common functionalities like rate limiting, traffic dialing, what's our problem statement, why we are here, why you are here, and then the solutioning part of it, and what's here in the roadmap for us. So let's look at what API gateway is. So consider a developer who is developing the API, is writing the business logic for your company, and his core work is just to write an API which can generate the revenue for the company. And as he develops the APIs, they get their first client and they share the API specs, all the authentication mechanism, and it works well, no issues with that. But as the spectrum of the business grows, you get your more clients like desktop client, mobile client, browser clients, even the applications within your own ecosystem or a third party applications which use your capability to develop a new capability. So now if you look at this, the developer who was developing the business logic, he has to take care of a lot more things now. Like how the authentication will work for all these different kinds of clients, how the authorization will work, what if one of the clients start misusing the APIs and they overload you, or does attack you. So how do I do the rate limiting? How does the traffic dialing happen in case the users, in case your business expands and you want to dial the traffic depending on the HTTP APIs or the type of the clients? So that's where your API gateway comes in picture. It takes care of handling all those operations like traffic dialing, rate limiting, authentication, authorization. So all your clients connects to the API gateway. It takes care of all those things and then forward the request to your back end service. So that's what API gateway at high level. And in Intute, we have been developing API gateway for the last 10 years due to our customs requirements. So we are not using whatever there is open in the market, but we have developed it over the last 10 to 12 years, our own API gateway which take care of all these like authentication, authorization, rate limiting, and whatnot. So with this understanding of API gateway, let's cover what service mesh is. So in Kubernetes, we know like this is generally the typical architecture where you have a service deployed in one cluster communicating to another service in the same cluster or to a different clusters. Now, if you have to get the same capability that we have discussed about the API gateway, you need something like this where your service will communicate, for example, in this case, S1 communicates to service S3 via an API gateway. Now, if you see something weird here is your service is going out of the cluster to a different VPC or to a different networking zone and then going back to a Kubernetes cluster and all that is happening within your own ecosystem boundary. So this is not what we really want in the Kubernetes world. We have a concept of pods containers. So what we really want is something like this. Each pod should get their own mini API gateway which take care of all those things that we have talked about. And in service mesh world, the capability has been provided by Istio. So Istio takes care of all your traffic management capabilities and it allows you for service to service communication. And that's what we call as service mesh. So with these two terminologies being clarified like what is gateway, what is service mesh, let's go and understand some of the core capabilities that they provide so that we can build the story that we have followed, which we are here. Let's consider the rate limiting. What is rate limiting is about? So rate limiting means how much transactions are allowed to be served for your service. So for example, let's look with an example. In this case, a service which is back-end by API gateway service mesh, it allows only one transactions per minute. So how does it work? So at the beginning of the second, you get your first request. It is within the limit. So what will happen is the request will get forwarded to the back-end service. Now after some time, you get another request. Let's say at the 30th second, you get the second request. At this point of time, what gateway service mesh will do is it will reject it because that's crossing your threshold. So that is what we call at a high level rate limiting. Now let's understand what is traffic dialing in our use case is. Let's look at a client and a service. So a client is connecting to a service which is deployed in a Kubernetes over let's say deployment one and all the request is going to that deployment. This works well when your service is small. But as your business requirement goes, what exactly happened is you shard your application depending on the use cases like either HTTP verbs or depending on your applications. So in reality, it's like this, where your service is deployed over a multiple deployments. So the scenarios can be like this, where your clients are connecting to your application or like, for example, in this case, get V1 connects to deployment two, update calls goes to deployment one and get V2 calls goes to deployment three. That's how you shard your application. So that's what we call as traffic dialing. And if you look, one more thing here is what we really want is whenever a user makes a call to the service, before it exits the NY proxy or the STU proxy on the client side, it should know where it has to go. So traffic dialing really, really need to happen on the client side because if it reaches on the service side, you can do that. But that's a costly operation that you need to keep on redirecting the request, right? So that's what we call as traffic dialing. So now, with a basic understanding of gateway, service mesh, rate limiting and traffic dialing, let's understand what is the complexity that happens when you have a scale of 300 plus clusters running in the production. What can go wrong when these capabilities work at a scale? So typically, in Intute, whenever we start a service, whenever we develop the new microservices, we start with one region. And then as the business grows, we expand it to the second region. And typically, all the services are behind the API gateway. They take the traffic from API gateway. And we call it as North-South traffic. So we know any traffic which is coming over an internet will go through an API gateway. And we call it as North-South. Now let's consider a scenario where you have a service which is within your own ecosystem. So how does this communication will happen? If you remember a few sites back, I talked about service mesh. So any service which is deployed within your own ecosystem boundary in the Kubernetes cluster, you use a service mesh. So this is a typical architecture in Intute for any of the services that we have. Any internet traffic comes over an API gateway. Any traffic within the Intute boundaries generally goes through a service mesh. Now let's configure the rate limiting in this architecture. So we have our service deployed with this architecture. And if I have to deploy a rate limiting for North-South traffic, where should I deploy the rate limiting for North-South traffic? It is at API gateway, right? So you configure the rate limiting for region one. You configure the rate limiting for your region two. And your North-South rate limiting is done. Now let's look at what should I do for East-West traffic? Where do I need to configure the rate limiting for East-West traffic? It is at service cluster. So you configure your rate limiting on the cluster. Similarly, on the region you configure. And this works really well for your East-West traffic. But if you see what we have done is we have configured the rate limiting at gateway. We have configured the rate limiting at service mesh. So how do we keep them in sync? That's a real problem. Now in our case, where we have started our journey like 10 to 12 years back with only API gateway until it's adopted the STU. There were many services which have started configuring rate limiting at gateway then at service mesh. So we have a great challenge like all these 2000 services which are running in production, how to keep this in sync. So this is one of our problem statement. Let's keep this problem statement and let's look how to do the traffic dialing. Again, going back to our deployment architecture. So where should I configure the traffic dialing for? North-South traffic. It is at API gateway. Because that's where your internet traffic is coming. So you configure the traffic dialing at API gateway. But what about East-West traffic? Where I need to configure for East-West traffic? If you remember, I just talked about a few slides back. What traffic dialing is and where exactly we need to configure. So the traffic dialing needs to happen at client side because if the request reaches on the service side, then there is no point of doing the dialing. We really want to handle it on the client side itself before the request exit the Istio proxy. And that's how you configure your East-West or traffic dialing. So this looks very good. Like you work with your client, you configure all their yamls with all the traffic dialing rules that you really want according to their timeline and it works perfectly well if you have just one client. But now what happens if you have hundreds of such client? In our case, like a given service can have hundreds of clients who are using this and they are spread across the clusters. So you can have something scenarios like this where your clients are running on hundreds of clusters connecting to your service. So what do you need to do again? You work with your second client, get their timeline, you are on their mercy, whether they are available and you configure your traffic dialing on client too. But what happens during this phase? You are really dependent on them and that makes you frustrated. Like you cannot scale your application because you have a dependency on them and so forth. So you work with your third client, fourth client and at the end you are really, really frustrated and annoyed like, oh, this is not what we really want at the scale of like 300 plus clusters. What we really want is something, a different thing which can help us in doing all these things seamlessly irrespective of dependency on the clients. So let me call Sushant to discuss the solution part of it. Thanks, Sumit, for setting the context for this, for our talk today, they're explaining us the basics. So let's get started. So as Sumit said, we're going to discuss two problems in our service mesh adoption with API Gateway that we already had. So first thing, yes, service mesh config distribution and reconciliation at scale. So we deploy IstioD in every single cluster. So that means we have to, the client-side configurations really have to be distributed in every single cluster, right? So we have Istio control plane running in every single cluster. So now the problem is if I have a service, 300 different clients hosted on 300 different clusters, let's say it's a pain to really go work with everybody to distribute the configurations and it's a drag on our developer velocity. So the second problem is, again, as Sumit mentioned, we have an API Gateway which we have developed over the last decade and we've kept improving it, but now service mesh is much more recent, right? So, and it solves a completely different set of traffic. So the East-West traffic, API Gateway solves the North-West hotspot traffic. So any configuration, the trade limiting, as I said, or traffic dialing has to be done on both the sides and they all have to be kept in sync. There should not be any drift, right? So imagine thousands of developers at a company doing, working on this, there is going to be some drift and some incidents that are going to be caused because of this. So we'll walk you through our story of how we went about solving this problem now. So there are many different ways you can solve this. So this is one of the ways that we've solved it at Intlub. So let's tackle the config distribution problem in the service mesh first. To solve this, we introduced a new component called Navik. So let's use the same example that Sumit started us with. So the service is hosted on two different clusters in two different regions. We introduce Navik. Navik is running in a global control plane. So this means it's not running in every single cluster. We have it running in one cluster in West, in region one, and we also have DR solution which we are not going to talk about today. Navik also runs in the other region as well. So this component Navik is monitoring the API servers of every single cluster, right? And it also knows where a particular service is hosted. So in this case, it's really these two clusters, right? So how does it get to know? We use annotations. So we have introduced an annotation called identifier. The value is the same. Then Navik thinks that, hey, it's the same service which is hosted on these two different clusters. And yes. And what we have also done is we introduced a new custom object. We talk about abstractions going forward now. So this is the first abstraction that we introduced which is a traffic config custom object. And this is really used by the owners of this service, in this case, the service one, to control the runtime behavior of the service mesh. So again, Navik is continuously monitoring changes to this custom object. And the config looks something like this. So you have options for upgrading, rate limiting, traffic dialing, timeouts, cars and a bunch of other things. So let's say the owner of the service wants to change the rate limit of the service from 50 to 75 TPS. Navik immediately identifies that, hey, it's a config change that's done on that service. It knows where the service is hosted, that these two different clusters. And what it's going to do now is it's going to call the API server of region one of the particular cluster and write a particular config. What is this real config? So at Intubit, we have written our own custom rate limiting filters. So this rate limiting filter now has to be injected into the filter chain at a particular location of convoy filter chain. So the way to do this in Istio is using a custom object called unvoi filter. So this essentially Navik is generating the unvoi filter and that necessary configuration, now it's updating the value from 50 to 75 and it's writing it to both these clusters. And that's it, right? So what we've done with this is we have abstracted the service level configurations. So the service could be hosted on two clusters, two 50 clusters or 100 clusters. The owner of the service doesn't have to go update the configurations across, he just have to work with one custom object in one global control plane and Navik is going to distribute it across and make sure that there's no config issues, right? This is this configuration unvoi filter that is generated is using an automation. So there will not be scope for any config issues. So let's go to the second part of, second problem that Sumit brought up which is the client site config distribution, right? So to do this, we really need a way to model what are the different clients that a particular service has. So we are again, we again contribute to admiral as well. So we've used the same custom object called dependencies that we use in admiral. So what, yeah, again, Navik is continuously monitoring these dependencies. This is what the dependency custom object would look like. It says, hey, the name is client. That's the name of identifier of a client. And it says, hey, I'm dependent on service one, service two and service three. Sick of this discussion, we have service one here. So let's say, yeah, yeah, we're in service one has five different dependency clients. Then we have five dependency objects. Let's introduce the client into this discussion. So we have the client one hosted on these two clusters. Again, how does Navik know about it? Yes, using the annotations. It's monitoring those clusters. So now it knows where are all the clients hosted? At this point, let's say the service owner of service one comes in and says, hey, I want to change the traffic that is sent by this client. I want to move all the gates to west and reach to east or any of those changes. Essentially, the config distribution has to happen on the client side. At this point, again, there's a config change. Navik gets to know about it. And yes, it's going to write, it knows where the client is really hosted in, which clusters is it hosted in? It's going to call the API server of those clusters to distribute the state custom objects. So in case of traffic dialing, we have not built our own custom filters. We are using the virtual service custom object that's provided by STO. We just configure the right values and we write to the particular API server in question. In this, the two clusters are getting updated. So if you really look at it, we were able to, what happened? Okay, so we were able to abstract the service side changes as well as the client side changes as a service, a user, a developer, in, into it, at into it, if they have, you know, deployments across services, clients across services, all of that is structured by this. So one byproduct that we also got from this is now we can also handle breaking API changes or, you know, YAMLs that keep changing. The custom object structure can change across STO releases. So the mesh admins are also happy about this, right? Otherwise, if we had to end the decision of asking all our users, thousands of developers to just use the custom objects provided by STO directly and breaking API change. Let's say you want to change the API from V1 to V2 or some structure has to change in the YAML, then essentially we have to work with everybody at into it, right? So that's not what we wanted. And now we are able to, since everybody is working with the abstractions, that is the virtual, the traffic config object and the dependency, we are able to transparently roll out any breaking STO changes as well across clusters. So at this point of time, let's start talking about the second big problem that we had, which is how to control the config drift between the API gateway that we already had and our service mesh. As Sumit said, we built our API gateway over the last decade, improved it over the decade. So we're not using the new age API gateways that's there, but now how do we keep configurations in sync? If you look at it finally, API gateway or service mesh, it's the same set of ports that are finally going to get the request, right? So we wanted a solution to keep these configurations in sync. Let's see how is that done. So this is how the self service is currently done at Intowet. We have a developer portal where thousands of our engineers manage any requirement related to their development needs and create new namespaces, new AWS accounts and also configure our API gateway. So the way this config is done is using a YAML. Again, the YAML you can use to change course, change rate limiting configurations, change traffic dialing, anything that you need. And if you really look at it, this is where the inspiration came for the traffic config that we spoke about in the previous slide. So when we went through this exercise of defining, hey, what should the self service experience for service mesh should be like? We thought, hey, let's not introduce a completely new YAML and educate thousands of engineers at Intowet again. This YAML was already used for the last five, six years to configure API gateway. Folks were already familiar with this YAML. So we decided, hey, let's use the same YAML and add new fields into this if it's relevant for service mesh. And we also what we did is we went one step ahead and looked at what are the commonly used features on our API gateway and we abstracted that into a simple UI as well. So any changes into this UI finally gets reconciled into the YAML that we spoke about. And all of this gets stored in a config store and we have an API gateway which essentially uses this configuration. So let's say somebody again change the rate limit from 50 to 75 TPS. This config change has to be synced up all the way to API gateway. But you may ask, hey, somebody wakes up in the middle of the night, makes a config change. How does the API gateway get to know about this? Yeah, we used standard eventing mechanism for this. We have API gateway being as a consumer to a messaging service. So any change, we push a message to the messaging service and API gateway syncs the latest config. So that's how any config change is distributed all the way to the API gateway in a matter of 12 to 15 seconds. So at this point, let's see how we can unify both the words, the API gateway world and the service mesh world. So we continue to keep the developer portal be the place for all the self-service for this unified API gateway and service mesh. And the same YAML and the UI was continued to be the place for the self-service. At this point, let's merge both the words. So the left panel that you see is API gateway, self-service and the right side that you see is the right panel is the service mesh ecosystem that we have. So Navig is a component as we discussed which abstracts all the Istio conflicts. It's the one that's generating all the Istio custom objects and distributing to all the clusters. So, but there is a problem, right? So we spoke about dependencies and traffic config YAML that's there. So somebody has to generate and modify this. We did not want thousands of developers getting access to this, updating the YAMLs. No, that's not what we wanted. So we really wanted this to be auto-generated. So we wrote an adapter which, again, looks at the YAML that is responsible for API gateway. Any change in there, it gets to know about it and reconciles those two YAMLs that we spoke about, those two custom objects that we spoke about. So again, we follow the same mechanism. Somebody wants to change a rate limit. The message is pushed to the messaging platform that we have, the API gateway gets it. Now the adapter is also a consumer to this messaging service. And now it knows that, hey, something changed. The value of rate limit has changed from 50 to 75, let's say TPS and it knows about it now. It makes a call to the config store and reconciles these objects, the traffic config and dependencies. But now, since this value has changed, Navig gets to know about it. Since a custom object, the traffic config is edited by the adapter, Navig gets to know that, hey, the rate limit has changed from 50 to 75. And Navig also knows that, hey, it's the service one which got modified. It's going to generate those Envoy filters that we spoke about into these two beautiful clusters that we have here. So the end goal is, our users are really happy, right? So they just got a text box where they edited the rate limit, let's say from 50 to 75, and they were able to control this and distribute this behavior across two different runtime environments, API gateway and service mesh very easily. And so again, let's quickly recap. What did we just achieve? Yes, we were able to keep API gateway and service mesh in sync. We are able to distribute configs across 300 clusters in less than 10 seconds, but we want to improve this number going forward in easy way to distribute custom objects across large number because of abstraction. But the most important win for us is the increased developer velocity. It applies to all our users as well as the mesh admins that we have. So when we started our mesh journey, I think we were one of the early adopters of service mesh five, six years back. Initially we asked our users to look at the studio documentation, use those custom objects and directly configure it in their clusters. Guess what? They misconfigured it and the entire cluster went down. Along with it, since we are a multi-cluster system, the misconfigured value got copied over to multiple clusters and multiple clusters went down in a non-prod environment. So that's when we really decided, hey, look, we don't want to get into this business of having a service users, authoring configurations based on the studio documentation. So we want to take all the responsibility of abstracting the feature by we mean the service mesh platform team. And thousands of engineers at Intuit don't really have to go through the documentation for studio and have also a possible chance of misconfiguring since we already had a self-service for API Gateway. They just use that and it's almost abstracted from them that, hey, they also have a East-West configuration and a North-South configuration. They have a unified traffic configuration which takes care of both East-West and North-South. And the last point, we also spoke about it, it's, yeah, we are able to handle breaking CRD changes or, yeah, breaking studio changes doing upgrades transparently. So the mesh team is extremely happy because they have full authority to update configurations transparently. So what are the next steps for this? So we want to handle feedback automatically for write operations across clusters. At this point, if write on one of the clusters fails, we really rely on logging and alerting to let the on-call know that something is broken and let me go update it. We're working on designs to automatically heal our ecosystem when something like this breaks. This can happen due to network issues but we want to be resilient to any such changes and automatically fix these things. We are improving observability and we are also working on adding the most commonly used features provided by Istio in Navik. So we are almost at the end of KubeCon but just wanted to share that these are some of the talks my colleagues at Intubate have presented at KubeCon and also at the co-located events. So we have open sourced Navik this week and this is a link for it. So this entire slide deck is available, sorry, the QR code is for the feedback. So we don't have the QR code for Navik but it's completely available in the SCAD app. This entire slide deck is available. We have open sourced Navik and please feel free to check out, please request you to check out, try it if you have any new use cases that you want us to handle. You can, if you want to contribute, you can reach out to us on this email or you can reach out to us on GitHub. And if you also want to check out, hey, what is Intubate doing about open source, you can please scan this code on the left and follow Intubate's open source work as well. Thank you. At this point, I'd like to thank everyone to be patient with us when we share our traffic story with you and we also want to thank the organizers for helping us provide this platform again to share our story with you. Thanks everyone. Thank you.