 Hi everyone, thanks for coming to our talk today. So we'll be talking about the trials and success of adopting Envoy at Tinder. So I'm Yuki, I'm an engineering manager at Tinder. I'm part of the service mesh migration team, and this is Cooper. So I'm Cooper, and I'm a senior site reliability engineer at Tinder. And we've been working together for around three years. So yeah, this is just kind of the overview we're going to talk about today. Just get started talking about why Envoy, some of the migration hurdles, some of the human costs with the migration. Cooper's got some nice before and after mesh charts for you. So some Grafana metrics of what our services look like after the mesh migration. We'll talk about how we did some of the network tuning, like nitty-gritty retries and timeout tuning, some of the observability, so how we incorporated Envoy in terms of Prometheus stack, some of the war stories. So things we figured out only after working with Envoy for a pretty long time. We'll talk about how we do our Envoy configuration, a bit about our control plane. And then lastly, just talk about the future work. So today we're fully on Envoy and Kubernetes. Took about like one year for the full migration. Started around early 2019 and did early 2020. And right now we have 1.6 million RPS meshed in our infrastructure. So that's like internal and external requests, over 200,000 containers, 20,000 pods and 500 services. And this is just kind of an overview of how the infras laid out. So you have the user on one end using the Tinder app and then requests from the Tinder app go to one of three Kubernetes clusters and each cluster is sharded to an AZ. So the cluster at the top, cluster A is entirely in the 1A availability zone in AWS. So it's wholly self-contained and all the requests get processed within that cluster. So that gives us a lot of savings in terms of cross AZ costs as well as latency. And as you can see, we've amplified pretty much every part of our stack. At the Ingress layer, we're using Envoy. Within the Kubernetes cluster, we're using Envoy as a sidecar and all the pods. And we have some ELBs, ALBs, EngineX laying around, but the core network logic is implemented within Envoy. So why Envoy? It's got a lot of great features. So we didn't use all these features right in the beginning, but we do today use, I think, almost all of these, multiple algorithms for load balancing. We try circuit breakers, timeouts, various database filters, which we use for observability. It gives us metrics for free and then some tracing. So Cooper's just now gonna talk about where we came from. Yeah, so before Envoy, we used really conventional Kubernetes networking. So we did Flannel for our pod-to-pod network and we used our regular Kubernetes services, you know, cluster IP node port with a couple of headlifts around. Very few ELBs, but just mostly standard cluster IP. This worked really well out of the box and helped us scale, but we really ran into performance issues. So our clusters were pretty large at the time, around like 4,000 nodes per cluster and Flannel started using really, really high CPU usage. We also saw stale TCP connection tunnels with IPvS and we also had hanging connections, which caused undue load. We ended up with a lot of uneven load on target pods and target services, which really pushed us to go to Envoy. Next. So we started migration to Envoy in 2019. So we had a lot of services still in the backup, you know, services cluster IP and we had all new ones using Mesh. And so we ran a hybrid model for quite a while. This worked relatively well. There were some teething issues with that, which we can get into later. But once we started going on Envoy, there was no turning back. It was so much better. Our observability increased massively. Our traffic to pods was significantly better. We had much easier syncing of like load balancing types of, you know, Rob Robin, Maglev, et cetera. And even latency just magically got better. Yeah, so some of the migration hurdles we went through, right? So we started really with zero Envoy knowledge across the entire org, you know, we just knew kind of what was out there in the docs and the blogs, you know, which is kind of sparse because it's new technology, right? But we did eventually, you know, end up learning it pretty well over the last two years. And we migrated over 300 services over a year. Just started with like one person working on it but had like six to 10 people working on it. By the end, full-time was definitely a team effort. You know, there was multiple people involved, people working on integrating it into Kubernetes. There's a team working on the observability, the Prometheus stack, a team working on the, you know, configuration automation, integrating it into the CICD stack, migration of the services itself, you know, people working on the control plane and then, you know, people working on the network tuning, right? And of course the service owners were, you know, initially hesitant with the migration. Definitely a time sink, but as people saw the benefit, right, they just started to volunteer because it was such a, you know, so many things you got for free. And then, you know, the process just became really streamlined, pretty low effort, didn't require someone with high envoy expertise to the babysit and all their services today are by default have service mesh enabled unless there's like a very good reason. I don't think we have a single one that doesn't have service mesh disabled now. And then of course, critical high-traffic services were pretty hard to migrate. I think we, you know, we spent like a couple of weeks just tuning timeouts and retries for certain services. So there's a lot of babysitting on those like high RPS ones. And then currently we have 500 microservices in the mesh, 20,000 pods migrated and every request at Tinder now goes through a service mesh. All right, let's talk some nice little graphs here. Okay, so before and after mesh, these are some old graphs I pulled from 2019. So bear with me, if some aren't perfectly clear, they're kind of old. So our latency was so much better after the cutover. I added a nice little green line so you can tell before and after, if you can't see that closely. But you can see that like our P99 latency massively took a downturn afterwards. Much of this originally was just due to better pod distribution. So before, as I said, with IPvS and Flanna, we would have one pod be super hot and super skewing the P99 up, but after Honorvoi really brought the load distribution to be way more even. CPU usage, again, following up the last thing about load distribution became significantly better for the called service. This was only after we were partially meshed. So I would say about this point, we were about 50% of the callers who were meshed. And you can see the per pod CPU distribution become much tighter and much more coherent as the time went on, especially after that first major cut, which I think was around 30% of its calls. So how did we get there and what do we do to get better? So we did, in addition to just Onvoy, we started adding timeouts and retries. So we didn't not have this before, but it wasn't very consistent. So we had modules in our Kubernetes cluster whose own code handled retries and timeouts per application. We had to remove that because you cannot have per application retries and timeouts as well as Onvoy. You get a huge uncontrollable, potentially bad situation if you do that. If you have improper timeouts, you may be either premature in cancellation calls or even potentially worse, you're doing a retry storm, or one retry can cost two more tries and so on and so forth. So we had to do a coordinated effort between back-end code and operations to clean up timeouts and make sure we disabled retries. Due to some even legacy code, some code owners didn't even know they had it. So it really was a drill down through a whole bunch of code to figure out what exactly was there and what needed to be disabled. But once we fully got everyone off of that, moving to Onvoy, timeouts and retries was significantly better. It was easy, centralized plays. We could tweak it significantly better than having it in code. In addition to timeout and retries, we also do egress lists via Onvoy, which locks our mesh network down securely. Cool, so in addition to the load balancing graph I showed you before, we also did a lot of tuning and timeouts and retries. We didn't get to quite to do hedging retries, but just cutting off the top of the P99s, making a retry at a faster rate really helped P99 overall. For what I mean is, let's say we know that service is gonna get back in 500 milliseconds and we have errant pods, maybe doing two seconds. We just cut it at 500 milliseconds or 600 milliseconds and make them retry. And that retry will be faster than just waiting for that bad product to come back. And it really sliced down the top P99. So one of the things that I love about Mesh, besides just the resiliency, and I'm kind of an observability nerd, is that it gave us so many metrics. So on the old, old, I can't read that sign. It's pretty dark in here. Okay, thank you. So on the old side of our network, we had some metrics given us by modules about HTTP that again was not consistent or unified across everything. Between node, between go, between Java, between teams even, we did not have centralized metrics of HTTP. Let's say we'd had about 30% coverage in our Kubernetes cluster, which is a massive monitoring gap. Envoy immediately, I mean, as soon as we onboarded even a couple onto Envoy, we got great metrics. We had HTTP codes per route. We had latency. We had retries. We had adaptive concurrency metrics. We had everything we ever want, even per pod. It was great. Like there was no longer wondering, oh, the service is getting some 503s and finding out that, oh, it's one pod. We know exactly which pod it is now over a standardized method. We use Grafana heavily. We're a very permittious focused organization. And we pipe all these metrics into our permittious and then I'll set up. We do really cool things. Like we have Canary dashboards and Canary deploys that use prime metrics, using Envoy metrics. They can say instantly that pod that just went out is worse than the deploy that was already running. This one is very hard to see, but if you look at the top right, we can see that the Canary is doing like 30% better than the last one. So we have all these really great mesh metrics that allow us to do good deploys and good observability. Another fun thing we do with mesh metrics and standard case metrics like Cadvisor and Coup State is make standardized dashboards per team. What I mean is, there's no longer does a team have to immediately just make their own dashboard to track their metrics. We go through an automated process that if you make a new module, you get Page of Duty alerts. You get Slack alerts. You get everything you want and it's fully based off mesh metrics and case metrics. They can have their own if they want to and we encourage it, but everyone gets these base dashboards. We also do filters for DynamoDB, Redis and Elasticsearch, which is a tweaked HTTP filter. So we get really good insights and metrics on those services as well, being called from mesh services. Okay, so what didn't go well? As I mentioned earlier, the timeouts and retries needed to be locked in and when they weren't, it was kind of rough. Since we do a microservice organization, we do many calls to go from, you know, Microsoft's A to B to C to D and if your timeouts aren't scheduled correctly and especially if you don't have cancellation and propagation in your code, you could potentially have like 13 tries for every try you start at the beginning. So you need to make sure that you always go down that each chain can be encapsulated by the one before it. Otherwise you can do way more retries than you could ever think possible. So we did filters, we did all this. What else did we learn on this? So we learned that if you best make sure that Envoy is the first to start and the last to stay alive. Since we are doing egress and ingress through all of them, if Envoy is not up when you start and not up when everything goes down, you will drop connections, you will drop requests. If Envoy goes down first when both get the SIG term, when the pod gets terminated, their in-flight requests on the module will get dropped. Not only do we find out that we need to do that, but we also found out that on the cluster in general, it was nice to catch the SIG term and let's say if a pod gets terminated for at least 10 to 15 seconds before you even pass it through. No matter if you're using like old school IPVS or IP table services with Couproxy or you're using the new EDS or XDS API servers, they're not always instantaneous. So just stay alive a little bit longer to catch up with a couple that are kind of slow to update. So we originally didn't do this and it caused a whole bunch of issues with deploys. So what I'm talking about is on Kubernetes, if you do a pod terminator pod role, once it gets sent the termination call, they send a SIG term down to all the containers in the pod. If you start shutting down immediately, assuming even that your code is handling grayscale shutdown properly, you're probably gonna die in about a second or so, even if you do a pretty good diligence job. We learned that even after that second that there were still extraneous callers that took a little bit to update from various sources. So that dead pod would still be getting calls. To combat that, as I said, we did a 10 to 15 second wait on that SIG term. So even though all those pods were determined to be terminating, they were still well enough to handle traffic for another 10, 15 seconds to really just mop up the last few requests. That brought down our 500 class codes by almost 100% during deploys. So we have a truly, really easily fun deploy setup. Hey, so I'm just gonna talk about global rate limiting since it's I think a fairly unique Envoy feature since it requires it's like an external service as well as like a Redis backend, a little different from the other Envoy features. And this was also like a pretty big win for us because previously, various backend teams, they would write their own rate limiting logic, right? With their own server and their own database to track the rate limiting state. But with Envoy, we're able to just consolidate that all into the Envoy network layer. So we have global rate limiting, right? Not per server or like per service. We have granular configuration. We can rate limit on one or more headers. And then of course we gave, we got monitoring, alerting invisibility into the rate limiting, which was often an afterthought when people hand roll their own rate limiting services. So that was great. So today at Tinder, it prevents system failure, stops fraud and a lot of cost savings from not having to process those fraudulent requests and it polices about 200,000 RPS at Tinder. We can base it on user ID, IP, any kind of headers. So very useful. And we have two rate limiting clusters. We have an internal one for internal requests and external for external requests, each with their own Redis backend and rate limit service. And then the Redis request or even proxy through Envoy. So this is like a cool little chart I was able to make with the observability that Envoy gives us, right? So we can see where all the rate limiting is happening. You can see in the Middle East, there's some pretty big dots. I think in Europe, a lot of the attackers come from Europe because I think that the IP laws are a little looser than the US and you can also see some pretty big dots around where the AWS data centers like Virginia and the West Coast are. So Envoy configuration, I'm just gonna talk a little bit about that because it can be pretty complex, long, right? You could have like a thousand line long YAML and we needed a way to just surface the minimum number of knobs the service owners that weren't really knowledgeable about Envoy but they still want to tune like timeouts or retries themselves, right? So we created an in-house DSL. So the service owners just write a short YAML snippet and then it gets translated into an Envoy config and how it basically works is, right? There's like a ginger template that gets converted into an Envoy config YAML, gets baked into a config map that's rolled into our Kubernetes cluster. And we're able to just empower the owners and they can do self-service mesh tuning this way. So the DSL supports, retries, timeouts, rate limits, tracing, outlier detection, IP restriction, direct response, data filters. And we're adding more and more as Envoy features come out. The only thing not in this DSL was the endpoints because that's supplied via the EDS control plane which I'll be talking about in the next slide. And we just have a regular kind of GitOps flow, PR, it gets approved, merged, and then deployed via something like Jenkins into Kubernetes. So the control plane we have right now, we just implemented the minimum API endpoint discovery service. We didn't implement the whole XDS future set. The endpoint discovery service was actually sufficient for us, it's built on Go control plane and how it works overall is just relatively simple. I think just Kubernetes will watch on pods on any new pods, any deleted pods, change pods, the pod IPs for each namespace are stored in memory. On any deltas we're sending out these new pods, pod IPs to the subscribed Envoy's and then the egress clusters are defined in the config map. And then some of the custom functionality we added in there was say if an Envoy in 1A, AZ request endpoints, then it only received endpoints for 1A. So we were able to implement locality-based routing, again just for saving on like cross AZ costs and latency. Yeah, and then what we're working on this year is something called FireMesh. So just like to go hard on the fire project names at Tinder and we're implementing a full XDS control plane with all discovery services implemented. So we're implementing on top of endpoint discovery service, route discovery service, cluster discovery service, secret discovery service and the listener discovery service, trying to roll it out by end of year. So if Tinder goes down in December, you know what happened. And then of course we'll be piping it all through ADS, the aggregated discovery service. We're using S3 for storage and Kafka for messaging. And this unlocks like header-based routing, percentage-based traffic routing, which you can do without all these XDS APIs but it's a lot more cumbersome because you need to redeploy the pods or the CMS and wait for those to propagate. But with the XDS, it's obviously very low touch, instant rollbacks and config changes. Yeah, so just in summary, service owner is very happy. Everyone at Tinder is very happy. We have this implemented. We have a holistic platform that integrates with the monitoring stack, the Kubernetes stack, the CI CD stack. I think the biggest win right is just immensely less code to maintain across the org from the network automation, observability, et cetera. We moved that all from the application layer to the Envoy layer and just some of the future work we're gonna be doing is ADS as I just talked about the Delta XDS Envoy mobile for iOS, Android, you know, investigate some custom Lua Watson filters and then do some request tagging. And that's it. Thanks for watching. Yeah, do we have time for questions? Yes, any questions? Please raise your hand. Where are you? So he asked us why we did our own control plane versus Istio. The answer with that is we started building this before Istio went 1.0. That was the real main thing of it. And once we got in a velocity, it was just too painful to switch over once we had our system flew pretty much where we wanted it. But yeah, Istio is great. You guys are just starting out. I fully recommend it. Anyone else? Okay. Cool. We will be on the Slack if anyone has any questions for us there. Thank you.