 Good morning everyone. Today I'll be talking about running an API gateway on Istio, our pitfalls and learnings. My name is Karim Lakhani. I'm a senior staff software engineer at Intuit and I'm the technical lead of Intuit's API gateway. So our agenda today is talking about Intuit API gateway, why in-house, why we migrated to cloud native technologies, our architecture and the pitfalls and learnings that everyone's here for. So at Intuit we're leading the way in building an AI native development experience using cloud native open source technologies and this is some statistics on our scale. I'll call out $560 billion of money moved and some more stats on this on our developer environment that's built on cloud native technologies. We're running at peak over 1 million cores, CPU cores, 2,000 services, all of which have multiple APIs, one or more APIs, over 16,000 namespaces and over 1,000 teams and the API gateway is the front door to all this infrastructure. So the Intuit API gateway at a glance, what is the role of it? So as I mentioned it's the front door to all the requests coming in and it also serves for a lot of service-to-service communication. Before, recently in the last few years we've adopted Istio service mesh but before that there was a need for service-to-service communication and we actually used our API gateway for that as well. So it provides routing, security, authentication and authorization of both the client application as well as the user metrics and dashboards including our golden signals which provides availability latency metrics for every service on API gateway, very detailed access logging and quality of service features including rate limiting and traffic dialing. So the benefits and stats of our API gateway, it's highly scalable, we run over 30 billion requests per day at peak and over 1 million requests per second and we scale up to about 6,000 pods across all of our environments at peak. It's highly available, there's no room for downtime really so we have four nines of availability at the minute level. It's highly reliable as I mentioned it's the source of the golden signal metric so it can't be wrong when it says that a service has a certain error rate or certain latency, it has to be trusted, it has to be a low latency so we look at targeting 30 milliseconds at the P99 for our overhead and it has very deep self-service management through our developer portal, the onboarding experience as well as the configuration management experience is all integrated into there. So why in-house development? Our API gateway started about 11 years ago and when we were still in into a data center at that time we had a need to move towards microservices from our monolithic architecture and so at that time there were not too many open source options, the ones that were available were not highly performant as we needed and so we felt that we could build our own and we saw a value in being able to customize it to fit our needs so we built our own API gateway and then over the last 11 years we've made a lot of customizations as we anticipated including deep integration with our identity providers, we do a ticket exchange and as well as a lot of features to support different business use cases such as fraud detection and traffic capture so our approach is very similar to Netflix Zool2 and AWS API gateway, we use a non-blocking and asynchronous architecture which is written in Java and one of the differences is that it has partitioning built in as a first-class feature so our partitioning is really key in the control plane we have partitioning logically in the data and then in the data plane we have partitioning in the compute and the network so that way we're able to isolate workloads such as QuickBooks, TurboTax, MailChimp, Credit Karma into their dedicated infrastructure and keep all the data separate as well as the compute and then compared to AWS API gateway we have more relaxed quotas so we allow longer timeouts and larger request and response payloads so as I mentioned we started in the data center and then we did a lift and shift into AWS cloud and then after a few years we started considering should we move to cloud native technologies and so I'll talk about why we decided to do that the first step the first reason is Istio so we wanted integration with the service mesh right the service mesh came about three years ago and our API gateway was standing alone from that it wasn't really connected to it and this put an extra burden on the service developers to support traffic coming in from gateway and from the mesh as well as us implementing special mechanisms to establish trust between our gateway and the mesh so we wanted to have a deep connection between those technologies and that would improve our security reliability and observability it would also reduce our data transfer costs because we are relying on public internet to move these bytes and with Istio we can go through a private networking and additionally we are looking to enable network abstraction which is our way for the client developer that regardless of whether they're calling an API gateway endpoint or a mesh endpoint it's the same experience so before this you would call a gateway endpoint using a .com using a certain authentication protocol and with mesh you would use a different one so we wanted to unify that experience next we wanted to improve our observability now we had metrics we had logging however there were gaps in it and we were using some legacy technologies which were causing quite a lot of overhead development wise and deployment wise so we saw an opportunity to improve that with Prometheus and other cloud native technologies additionally we had our cost analytics were rather complicated it was built all custom the the charge back model that we have and how we report on that so we saw an opportunity there with better isolation of workloads with Kubernetes clusters with namespaces to get clarity on that and then finally we wanted to take advantage of Intuit's modern SAS platform which in the last few years we've been building on so you know we want to take advantage of these cloud native technologies such as Argo CD Argo rollouts Argo workflows and Istio service mesh which is built on Istio Intuit service mesh which is built on Istio and admiral so we were seeing issues you know with our custom scripts our custom deployment it was hard for new team members to join and get on boarded we saw an opportunity there to improve so that we could be on the same technology stack as other services at Intuit at Intuit we believe deeply in open source and open collaboration so we're the recipient of the end user award in 2019 and 2022 and we also have various open source projects that we've created and we maintain including pneumo flow admiral Argo and we're the end user of multiple open source and cloud native technologies including Kubernetes, Istio and Envoy so there's a link here for our open source community if you'd like to join so quickly I'll go through our API Gateway architecture before I get into the pitfalls and learnings on our journey to migrate so it all starts with our mobile applications our web apps our users and they make requests to various APIs as I mentioned there's 2000 plus different APIs at Intuit not all user facing but that all comes through our API Gateway and so our API Gateway will have multiple APIs that it's serving it's in Kubernetes and Istio enabled now we use admiral which is our open source multi cluster service mesh solution to inject all the configuration for all the different clusters at Intuit so that API Gateway knows how to route the request to the right cluster and to the right service so from there we're able to route the request to the API that was it's supposed to be meant for and then from there the services can talk to each other through the service mesh so that's a high-level view of our architecture and as I mentioned admiral provides the automatic configuration and service discovery because we have a multi cluster architecture and a little bit deeper look at that so in our control plane which is Istio based we have Istio D which has the Envoy config and Kubernetes config we use Argo for our rollouts and then we have admiral which is pushing all this multi cluster config into Istio and we also have a certman component which is responsible for the certificate management and then in our data plane we are a separate quest both from a public load balancer and Istio ingress gateway and the request come into our API Gateway through our Envoy proxy we have an MTLS agent which is managing the MTLS certs and then we are able to use MTLS to call out to our back-end services very securely and through using private networking and that way we're able to have a very secure and reliable communication so now moving on to our pitfalls and learnings I'll list them out here and go into more detail so I'll be talking about Istio sidecar in hybrid mode graceful shutdown configuration overload outlier detection and auto scaling and resources so starting with Istio inside car mode so in most Istio deployments traffic comes into the cluster through the ingress gateway which does a TLS termination and then a MTLS is established between the Istio ingress gateway and the Istio proxy itself now our API gateway supported a ALB ingress as well and we wanted to continue supporting a ALB ingress for that's where all our certificates were installed so we wanted we didn't want two different envoys we wanted one envoy that can do both MTLS with the Istio ingress gateway and TLS with the ALB and that wasn't supported so we worked with the Istio team to make an open contribution to the sidecar API with a new property TLS that's allows us to support both TLS termination on the Istio proxy as well as MTLS all on one port so for more details I have here listed this Istio RFC which goes into the configuration but it's just a few configurations needed to set that up next we have graceful shutdown now this is not exactly purely an Istio problem but we faced a more complication with Istio so as we started rolling out our API gateway with Kubernetes as we were scaling down we saw errors sometimes and we found out okay there's a lot of different settings that have to be set up exactly correctly to have a graceful shutdown so in our case we had idle timeout on our load balancer we had a pre-stop on the container which could sleep we had a termination grace period on the Istio proxy we had a termination grace period on the container and we had a target group deregistration delay so we had to do a deep dive and understand what all these configurations do how to configure them correctly and this is what we ended up settling on we have a max timeout on our gateway of about five minutes on most of our workloads on some of our workloads we do allow longer timeouts but this is what we ended up settling on so this way the request stopped coming to gateway when a when a pod is being deregistered and we give enough time for request in-flight request to finish before we start terminating the workload like the pod and the containers next one is configuration overload so as I mentioned we have a lot of services and into it api gateway has a lot of clusters and we have partitioning built in to our data plane but not in the istio in the istio world so without partitioning in the istio world all the configurations are delivered to every single gateway but every single gateway doesn't exactly talk to every single service so this was an area of improvement for us we were getting a lot of data transfer from istio d to istio proxy it was going cross az and even during a scale up we saw a cpu spike because of this so to resolve this we started implementing a partitioning logic into our admiral so that way the api gateway only gets the config that it needs to so our admiral knows about the partitioning of the data in our control plane in our registry and it's able to use that to make sure that gateway doesn't load all the config for every service in the mesh next we have outlier detection so in istio there's a nice feature called outlier detection which is a circuit breaker which tracks the status of every host and how that host is doing so that if it becomes unhealthy you can stop talking to it however in our gateway we are talking to 500 different services on one pod so every service that we talk to might only get a few requests and so it's hard to set a certain number of our failures that will trigger this circuit breaker and so what happens is a certain host on the back end might be unhealthy but we don't know about it we continue sending it requests now this is something that we still need to work on so it's a work in progress but we need to extend this and optimize it to work with this workload we're exploring active health checking or some type of global circuit breaker type of implementation because you know obviously when a back end is unhealthy we want to stop talking to it next auto scaling and resource limits again this is a common problem with kubernetes and we face more issues with istio proxy here so first of all our gateway has to be rapidly scaling to handle surges of traffic load tests we can't really control the traffic coming in so our hpa has to be very optimized and we originally went with our hpa implementation but saw that it was either over overscaling or oscillating and scaling and we also saw challenges with the we were using the average cpu of the containers on the pod and that wasn't really a great metric to scale on because istio cpu is fairly low while our application cpu is fairly high and so the average wasn't there to trigger the scaling so we were basically scaling on the average which was the default in our workload so the solution is to understand the metrics understand which metrics you want to scale on whether it's the application cpu the pods average cpu istio cpu on your workload and in our case we use now multiple metrics so we use both the average and the pod level cpu and another thing is we we weren't happy with the hpa implementation for our workload so we actually have extension of that called step scaling which basically generates a synthetic metric based on how much we're above the target cpu so if we're for example here i have you know if we're more than nine to 13 percent above the target cpu then we actually scale up our what we say is our cpu by 25 percent higher or 50 percent higher so that way we're scaling more aggressively but not too aggressively based on how much we're actually over the target so with this implementation we had a lot more control over our scaling and we're able to scale just as much as we need and not have it scale up and down now the last thing is our cpu throttling so we deployed this api gateway with istio kubernetes production serving all these requests overall everything was going good except for two of our clusters had a problem and things were basically slowing down latency was being added we weren't sure what's going on and then we found out it's because of cpu throttling so we didn't have that visibility so we had to get that added added to our dashboards and understand you know why is cpu throttling happening why our requests are on our limits not set for this workload correctly and we had to test different setups to see what's the best setup and we're still kind of going through that process right now so and part of this again is with how do you set the limit on istio proxy right that's another it's challenging enough and now you have another container and so far we haven't seen cpu throttling on the istio but it's something that we're keeping an eye on and in some workloads we did see higher workload on istio proxy so we did have to increase the request and the limits on that specific cluster so an overview is of the pitfalls and learnings on our istio cloud native journey is the hybrid mode graceful deregistration configuration overload our liar detection requests and limits correctly and our auto scaling and with that I will open it up for questions please follow our open source community we have a booth which we're have a swag and as well on the right hand side I have a qr code for a feedback on this presentation and also a link here to our open source admiral for more details on our multi cluster service mesh solution so any questions yes yes I'll repeat the question so the questions around our certificate management and our for mtls or it's like how do we do that so we do have an in-house certificate provider and we also have an in-house agent like a container that communicates with it as well I think it's turned off I think we are using I believe it's IP mode or it's is it IP mode yeah it's IP mode yeah that was always like that thank you how is that journey looks like what are the things that you kind of like you know looked into and what are the kind of tools that you might have used to figure out that things are safe to kind of like transition to I'm sorry I didn't hear the whole question so can you hear me no yeah I can do so from the in-house the api gateway to the new istio based api gateway what how was that transition like what was your experience in terms of like moving to the istio one what are the different things that if you elaborate on that like you know the new gateway is in production and you are safe and confident enough with this yeah yeah so I mean just to clarify right we're using we moved our gateway to kubernetes and we're using istio in collaboration with it so overall our journey was fairly smooth aside from these you know issues that we faced we did a slow rollout across all of our workloads across pre prod it was a multi month migration and you know I think pretty much now we're at a place where we have a much better security with our back end services we're able to use tls rather than a custom jot authorization or trust that we built earlier and now one of the great benefits is we have the istio metrics and logging on the destination so a lot of times the destination services would come to the gateway and say hey my service is not working is it because of the gateway and we wouldn't really have much insight on what's going on on their end they would have to kind of show us and then we would say well look your service is unhealthy so now with the istio metrics we're able to have a uniform way to understand what's going on on the service and troubleshoot those issues liar detection on one of your slides yeah you talk about circuit breaker patterns and things like that can you go a little bit into how you're achieving like multi tendency and isolation and yeah for the different services definitely yeah so the question is around how do we achieve multi tendency and isolation so we have isolation at the cluster level as well as the namespace level so we as I mentioned we have multiple clusters multiple namespaces now within a namespace we do serve traffic for multiple services up to maybe five six hundred services in some cases in some cases only five services depending on the workload so there is some risk for a noisy neighbor there but you know we do have that level of isolation and we basically have thread pools and connection pools set up to help isolate some of the traffic there however you know we have seen issues where if one service is unhealthy it can kind of spike the cpu and cause an impact on other service so that's also something that we're continually working on improving thanks for the question and with that I'm out of time so thank you everyone