 Hi everybody and welcome to how Niantic Switch Pokémon go to use in Boi. My name is Lanna Nyakobi and I'm a server core infrastructure tech lead at Niantic. Who are we? Niantic is a leading AI technology company focused on exploring new applications for advanced hardware, wearable devices in particular, that merit the digital world and the physical world. Our mission is based on three key product principles. Mission and discovery of new places, exercise and real-world social interactions with other people. This mission has been our compass. It's paved the way for our consumer AR games, Ingress, Pokémon Go, Harry Potter. And that's what we are arguably best known for, but we're a lot, a whole lot more than that. Underpinning these games is a powerful technology platform, the Niantic real-world platform, which we've been investing in and evolving from the very beginning. Niantic real-world platform also supports social features, mapping and advanced AR. All the components needed to develop and publish a real-world AR game. As our platform grew to offer additional components and new games were published, we needed to adjust infrastructure to support the new workloads and complexity. Let's review our journey. Our journey took two parallel routes. One for unifying our service-to-service connectivity and the other to simplify and support our Edge Proxy. Let me start with a simpler one, how we unified our service-to-service connectivity and secured our backend services. In the beginning, our configuration was simple. Engines deploys our Edge Proxy routing traffic to the game servers. But as new services were developed and deployed, different routing configurations were created and defined by the dev teams. Those connections were passing through the front door using regional IPs helping not to hit the public internet. Others were using external services or infrastructure like VPC peering, which still required a gateway to be defined in the cluster. And we even had services sending data-to-data flow to update the game's data. This created several problems that we needed to address. This problem included the need for developers to add security and instrumentation to code, our inability to configure routes dynamically, lack of observability in alerting if instrumentation wasn't added to code, or the services indeed connected. And lastly, the need for new series during the team to learn all these technologies but have very little control over them resulting in maintenance mode. As a result, we focused our efforts in finding a solution that will isolate the connectivity configuration from the application and will enable us to address the rest of the issues. We decided to adopt Invoy Plus Fire for our cross-GCPs communication. The flexibility provided by this pair enabled us to support our different models from duplex mode to either pull or push and unified security with Federation for games and services. By creating a pool of gateways, we were also able to shift all services from using the Edge Proxy to using their dedicated, more secure endpoints. So we named our new project Service Gateway, mainly to distinguish it from service mesh that became a synonym for ISTIO in the company and caused a lot of confusion. What we loved about this approach was that there was no need for application code changes and we only had to do cleanup. In addition, the solution was cloud-native and ensure high availability. We used Gradle scripts with help charts to deploy and configure Invoy Inspire. And our measure performance was one million seconds of edit latency with max measure load of 20,000 requests per second per one virtual CPU, which was pretty good. One question you're probably wondering about was why we didn't adopt service mesh implementation for orchestration and management for the mesh. In one word, maturity. When looking at CMCF landscape map under service mesh, these are the graduated projects. These are the incubating projects, but most of the projects are still in sandback stage or not yet added to the map. When we started our journey, we investigated using ISTIO but decided to hold off for now due to complex configuration and breaking changes between versions. Looking at our next steps, with a future of increasing numbers of back-end services and games, it is obvious for us that we will need to find a solution for orchestration and management at scale. What we liked about shifting to Invoy was that the most of the tools under development were using Invoy as the proxy, so migrating to them should be an incremental step and we still had the option of using XDS in the meantime as our management tool. Let's review now our more interesting journey with Invoy as an edge proxy. I mentioned before that our game cluster deployment is simple, but while the deployment is simple, there were several things we wanted to address. Improving user experience when scaling was one of the goals and I will elaborate more about what this covers and why in a minute. But besides this issue, we also wanted to improve our monitoring and alerts, unify as much as possible our tools and data plane, and lastly support the company's goal in enabling external developers to create new games without exposing RIP. I would like to focus for a moment on the goal of improving user experience when scaling, mainly because this part of our architecture caused us some trouble. Here, the big question of course is why scaling created bad user experience. Problem lies in the proxy configuration and config map limitations. Our proxy configuration is tied very much into an IP. For every incoming request, the proxy is checking if the request URL has a prefix of an opaque ID. If one is found, the proxy matches that ID to a game server, removes the prefix and route the request accordingly. Requests without ID are round robin between the available game servers. We refer to the specific part of the configuration as the catch-all. In addition, our protocol supports HTTP and WebSockets requests. As a result, our configuration file can't fit into a Kubernetes config map and is deployed inside the proxy image. This problem is true for engines and invoices. Looking on the process of scaling up, we first increase the number of game servers replica. When all instances are happened healthy, we update the configuration of the routes, leaving the catch-all the same, creating a new image and update the deployment. This step can be disruptive to clients that are currently connected through WebSockets or pending a response from the server. When all pods are updated and healthy, we update the backend with the number of available game servers and update, again, the proxy configuration creating a second wave of disruptions. Scaling down process is very similar only in reverse order. Well, XDS to the rescue. Using a custom XDS operator, we were able to scale up and down the clusters without the need to restart or redeploy them. The operator receives the request that cluster sites through a Kubernetes custom resource, update the configuration and pass it to Envoy. We are currently in the process of deploying this new operator. Our next steps after verifying the operator is stable and working as expected will be to integrate it with the game server existing drain feature for a seamless scale experience. This overview or a new deployment. After verifying, we can achieve our goals using Envoy. We started to plan and execute the migration from Engingstone void. Our high level plan included running load testing to verify performance and provisioning of Envoy, test the migration scenario, run books and rollback in the load test environment, deploy the new Envoy instances and shift traffic from Engingstone void. Sounds easy, right? We started with load testing using a Pokemon Go load testing setup and runners. We used the latest Envoy version, which was 112.2 and started with a basic environment of X game servers, X caching services, one proxy, and a load of hundreds of thousands of requests per second. The result looks great, except some pesky five or three hours that surfaced from time to time. We had to make a couple of changes to our production environment to be able to shift the traffic from Engingstone void. First change was to separate the public traffic from the backend services I mentioned before. Since we didn't onboard the service gateway yet, we took the approach of simply duplicating the engine's configuration and creating a dedicated pool for them. We then had to replace the existing external load balancer service with a node port with external traffic policy set to local instead of cluster. This allowed us to control the traffic and monitor both proxy pools while serving on the same port, which is a GCLP limitation. We also switched from old replication controller to a deployment object, which gave us an opportunity to do a simplified run of the traffic shift. Lastly, we switched from utilization-based load balancing to count-based load balancing, due to difference in resources between Engingstone and void. We copied the already prepared Grafana dashboards and created a new pool for the Envoy fleet. One of the features we couldn't find a replacement for, by the way, is the IP denialist, and we had to replace it with a Google Cloud Armor. If there is an existing Envoy filter that provides this functionality, we would love to hear about it, but if not, we will probably write a custom filter to provide this functionality. We're becoming very good in writing those. Well, the big day arrived and we started shifting the traffic from Engingstone void. Well, we started deploying. The traffic shift plan was to first increase the number of Envoy replicas in the pool when reaching full capacity, decrease the number of Engingstone replicas until hitting zero. We started with deploying a single Envoy instance. It was there for a couple of minutes when we started seeing an increase in five or three errors and then it crashed. We immediately scaled down to zero. This is just a short version, but after several days of investigations and digging up in Envoy and GCLP documentation, we found the following results. The main reason for the Envoy crash was because we were hitting it with 20 times the expected load. And even more, it was already deployed on a pod with limited resources, which might have been a contributor factor, but unfortunately, after correcting the traffic distribution to 2000 requests per second, we never encountered this problem again. The five or three errors were easier to understand but harder to find as there were a result of a discrepancy in the timeout for idle connections causing GCLP to return five or threes for connections that were terminated by Envoy. Fixing this part decreased the five or threes errors to a normal level. As it usually happens, we were ready for second try only two weeks later. We fixed the configuration, scaled it up to one instance and started monitoring the service. It was happily running for a day with no problems, so we decided we were ready for the next step. And scale it up to two envoy instances and we entered a new era of multi-invoy instances in production. Over the next two weeks, we slowly, slowly scaled up the Envoy fleet from two to five, from five to 50% and so on until we hit 100%. When hitting 100% of Envoy deployments and after verifying all is okay or at least that's what we thought, we started scaling down the engines fleet. In parallel to this migration, we had new features and services also added to the environment, creating problems of their own. So when problems started hitting us, we didn't recognize what is the source of them. Everything was looking good, except occasional five or three errors. And we also noticed an increase in annex domains received from Kube DNS that started when we deployed Envoy. We had no success tuning Kube DNS and had little information to understand the source of the problem. Like any other team, we were busy and with little information or access, this turned to be the status quo for four months until GoFest arrived. If you're not aware, GoFest is our biggest event of the year. When starting to scale the number of game servers in preparation for the event, we started receiving reports of an increase in five or three hours. This happened immediately after deploying Envoy images with the new configuration. If you remember step two in the scale-up scenario, that was before we even tried to scale Envoy, but hoping this might be related to load, we increased the Envoy fleet size. We didn't know that this actually made things worse. We also started noticing an increase in player complaints on the down detector. We investigated that configuration, wrong health check, everything, but found no clue except an interesting decrease in code DNS errors. That word confused us more, but then we found a hint. And just a reminder, our deployed Envoy version was 112.2 and this was fixed only in 114. So given the timeline and the importance of GoFest to the company, engines was deployed to replace Envoy and GoFest passed with no issues from the Edge proxy. And we went back to our low-test environment to try and figure out how to solve this problem. Back to our low-test environment, we started with reconfiguring it to much production. We scaled it up to much one-tenth of production and configured to have X10 services pointing to the game servers in help to recreate the conditions that caused the problem with sound production. It was easy. We got DNS errors already when hitting one-tirtieth of the production environment. At that point, I already knew what was the problem and maybe some of you are already guessing it. It was our configuration that caused the load in DNS queries. Our stateful backend configuration was causing the proxies to query code DNS four times for each game server. Now multiply that by the number of proxies, add to that the number of game servers and the number becomes too high. The solution we tested was Google Cloud new DNS caching operator which made the errors disappear completely, but will require an update to the JKE version. So my personal learnings are stop ignoring intermittent error, test in production-like environment instead of a low-test environment, continue evolving the traffic shift scenario to support future software and infrastructure upgrade. Just a reminder, we will need to upgrade our JKE version. One last part of Invoy I would like to talk about is our work of extending Invoy, mainly to provide external developers access to a platform without exposing our IP. Unfortunately, I can't share a lot about this work. I can tell you that included request validation for HTTP and web sockets requests, shifting the responsibility to block this traffic already in the proxy. It also manipulates the incoming traffic based on the received message and provide us with additional metrics related to the games. Part of this includes a web socket custom filter that we wonder if it shouldn't be part of the platform, even for the purpose of upgrading it to a GRPC at the proxy. And this is truly how our deployment looks like. I would like to summarize this talk with a wine voice slide. Well, we chose Invoy for the better of the mobility in monitoring, service to service proxy, the configuration, the better scale experience with XDS, and finally, the extendability. Thank you for listening, and I would like to thank everyone in Niantic that helped me in this work. And for those of you that find this interesting and would like to learn more, we're hiring, feel free to reach to me and check out our career website. Thank you, and I will be very happy to answer now any question you might have. Hi, everybody. I hope you hear me and see me. And I will be very happy to answer any of your questions that were in the chat. So first question, did we observe any issues when moving from engines to Invoy in terms of how the downstream connections are maintained by engines versus Invoy? And I actually asked here continuing questions regarding was it something specific? So as Srinidhi mentioned, we have seen an issue where Invoy retains high throughput connections constantly, whereas engines releases them productively. Did you see such issue during your creation? Well, we didn't see this as an issue. This was part of the behavior we already knew that this part of Invoy. As I mentioned, the main difference that we had between Invoy and engines was actually around the DNS and the DNS querying. And Janks are not querying the DNS and such so frequently, which was actually made us do some work models in our environments. With Invoy, we can now remove them. But the only issue that we had was with the number of DNS queries that we had. Second question was, can you elaborate on switching from utilization based load balancing to count-based load balancing? Does count-based load balancing refers to a number of incoming requests? So I forward this question to our SREs and part of our team that is also working on that. I believe this is a GCLB property, but I might be wrong here. And if we were mainly just looking on, yes, the incoming requests, so ensure that they're being correctly load balanced between different Invoy instances of the backend services. And thank you for the IP denies, we will definitely look into that. I think we looked at that in the past, but if you notice from the presentation, we started our journey a year ago when it was Invoy 12 and in A112 and it was Invoy evolved a lot since then. There are a lot of new things that we are using. So let me get back for another questions. So I get a lot of feedback here around what should we look for the IP deny. Thank you, I'll definitely look into them. It's really interesting. I heard that our back can help with that. Well, Rubens, if you'll tell me which slide, maybe I can share the presentation, but I think everything will be online soon. And you are always welcome to contact me through the Slack, I'm on the Invoy Slack channels or through LinkedIn or my email. And thank you, everybody. Somebody asked about the control plane and I missed that question. So we are using a control plane that was one of our moves to make the scale better. We are actually now going to start the plane with everywhere we have important. It's definitely going to improve our scale experience. And by the way, for the control plane, we're using, of course, LDS and CVS because our listeners, pathmatchings and clusters are really coupled into one-to-one. Well, we are no longer using engines itself in most of our places. And the one that we are using, I believe we're using the open source one. Well, I'm very happy to hear and help. And we are very happy to share from our experience. One of the things, as I mentioned, by the way, we did have to write a custom filter for WebSockets, which was a very interesting experience actually. And we will be publishing it in a couple of months. We need to pass a legal process for that. So everybody will be able to use it. So the number of our PS per core on our invoice, I do have it somewhere, but I don't really remember. And I don't want just to throw numbers. So, Mikhail, if you can just contact me, it's Rostak, I will be very happy to answer that. And yeah, thank you, everybody. If there are any more questions, we have a couple more minutes, but if not, then you have six minutes for refreshing. Thank you. Engines features that we could not replace with invoices with the IPv9s, that was mainly it. We actually gained a lot of things, moving to invoice. And we really enjoyed doing this mode. So, no, actually it was the other way around, except the Nile switch. I believe you will find solution very fast. Thank you, bye.