 Hello everybody. I'm Yuri, Principal Engineer in Apsa Group, and Zach, together with me, we are today, and we are going to talk about our experience of building and maintaining and operating cloud-oriented global load balancer for Kubernetes in production. Yeah, hello everyone. It's Zach Anderson. I run the Kubernetes service for Apsa. Nice to see you all here. Excited. And I'll give it back to Yuri that we can carry on. Cool. Thanks, Zach. So, let's jump straight away with the presentation. So, where and how it all started? Somewhere in 2019, Apsa Group figured out that the organization needs open-source global service load balancing function, some cloud-native solution to steer the traffic in a smart way over geographically dispersed Kubernetes clusters. And we needed a solution to be able to be aware of internal workload state inside this cluster, avoiding standard HTTP end-to-end checks, so to be real cloud native. And we didn't find any existing proprietary vendor solutions, and there were no also appropriate open-source alternatives. So, that's why we created KGB. The project started in December 2019. It stands for Kubernetes Global Balancer, and it was developed in an open-source, in a gift-average repository from day zero, so in totally open manner as an open-source project from the very beginning. Slightly more than one year, fast forward, we managed to attract a small community built a mature enough project and to reach the CNCF sandbox level acceptance, which we're really proud of. And we are using it in KGB production for more than one year, maybe already year and a half. So, let's talk about some Kubernetes KGB concepts that are very important from architecture perspective. So, it is cloud-native global service load balancing. It is built on top of an operator pattern, meaning that it is a controller which resides on Kubernetes clusters, and it's backed by an associate customer service definition. We don't have any single point of failure, so there is no control cluster, and KGB is deployed next to this will be enabled workloads. KGB utilizes standard Kubernetes primitives for its own operation. So, it ingresses services and associated both lameness and readiness checks. Overall operations is based on a battle-tested DNS protocol, which enables us to be highly reliable and to operate on a global scale. DNS is used both for traffic steering and cross cluster state exchange. We're trying to be as environmentally agnostic as possible, meaning that we are automatically configuring only zone delegation on an environmental DNS provider, like RALTC-3 in Publix or NS1, and we are not creating any other resource records to steer the traffic. KGB is responding to DNS requests on this own integral core DNS process. So, specifically to the components, KGB itself is an operator, a controller for CRD. We used the operator framework to bootstrap the project. It was very useful in the very beginning and provided us a good structure to create a powerful operator. Core DNS is used as a very important part, as I slightly mentioned before. Core DNS is working in cooperation with the KGB controller and watching for specific DNS endpoint resources and providing dynamically constructed DNS responses to steer the traffic in, according to a global service load balancing strategy. An external DNS project, which is pretty well known in the Kubernetes community, is also used to confirm gurus' zone delegation in a DNS provider, so as a provider of your environment being bought or on-prem. So, in our case, when we are operating in WSB configuring zone delegation in RALTC-53, when we are operating on-prem, currently for us it's configuration of Infoblox with a further migration to NS1. And overall, GSLB function, GSLB strategy is controlled by a single resource definition called GSLB. And that's how application teams are getting power over the global traffic steering and can rely on their own bot lameness and healthiness checks of their own application, which are actually directly affecting the traffic steering behavior. So, this diagram is taken from our official KGBIO website, which is depicting multi-cluster scenario. It is very simple flow. One important thing to highlight again is that KGB is deployed right to the clusters where our application that requires to be enabled, globally enabled, the same cluster where this workload is running. So, whenever GSLB resource is created by application team or associated pipelines, the special ingress resource is created and enables the HTTP traffic steering. And whenever the actual end user is making DNS requests, like this browser or by any other means, this request is going to end up on our KGB ports and core DNS ports. And KGB is going to return the DNS response according to the global strategy that is configured within GSLB CRB. And the requester, you'll get an appropriate IP address and you'll hit the ingress controller and you'll hit the match with the application port. So, you can imagine that we can build some couple of fabulous strategies there. That's exactly what we did. So, one is round robin. So, we just kind of spreading the traffic in a random manner over the GSLB enabled clusters. And another is failover, which we also frequently using and we'll be using for the demo today is a strategy where we can pin one of the data centers, one of the clusters to be primary. And whenever we're plotting this primary cluster, it's healthy, the traffic will be served from this DC1 only. And whenever workload is treated unhealthily, also dead, then it will be automatically failed over the secondary cluster. So, it's important to understand that this failover is cross-regional. So, in our case, it's on-prem case, it's distinct data centers. And in AWS case, it is different regions. For us, it's Cape Town and Ireland. So, it's really cross-regional load balancing. So, that's high-level design of KGB. And we can talk about internal adoption by Apsa Group. And, Zak, let's go ahead. Yeah, thanks, Yuri. That was an awesome overview of the components of the solution. Really, it's good to see the diagrams again. Okay, so from an Apsa internal perspective, I'm going to talk a little bit about what we do and why we do it. So, we've got about 12 tenant teams that are running this. And from a perspective of clusters, we've got 122 Kubernetes clusters on-prem today. 36 of them are basically enabled for KGB. That's mainly our core API payments engines. It's our markets FX. It's our FX as well. There are also some other services that are sitting there for authentication. So, we're using that, for example, KeyClok. We've got an active setup sitting in those production services. From a services that are enabled quite easily, and like Yuri's explained, it's an annotation that gets enabled. And that's why we can roll out these GSL-B enabled services quite quickly. Once it's installed into the clusters, they become GOA. So, it doesn't matter whether it's in one data center or the other, those two clusters then start working together. And that's a simple annotation. That's why we've got currently 360 core bank backed services that are sitting there. So, obviously today, Yuri's mentioned that we're using Infoblox. We built this because we could not get a solution to integrate properly into a DNS or edge DNS provider. The other one that we are definitely going to switch to would be NS1. And that's a seamless migration. So, the nice thing about KGB, if we're adding in the next edge DNS, it's simply a switching to the next provider. And then our last section would be Route 53, which is going into the cloud, which Yuri has already mentioned. We're going to A of South 1 and EU West 1. And that enables us to move workloads between on-prem and in the cloud quite easily and quite quickly. So, the biggest thing about the whole solution was, how would we do new tenants and services? And basically, we will drive failover automation as far as possible. We are trying to integrate automated pipelines and then make them aware so that the teams or the developers can actually enable their own exposed services without going to the network team or going to the DNS team or going to any other team to failover my stuff in a failure scenario. And that was the beauty about KGB. It gave that functionality that we could lose a whole data center or another data center. And if we had cloud and on-prem, we would have officially four data centers running. So, we could run workload in all four data centers. The nice thing that we did is because we are a template or templatized environment, this was simply a module that we added into our CRCD and we templatized the actual annotations for the teams and all they had to do was put like three variables in and the annotations would then flow into their pipelines and it would go from dev, SRT, UAT and prod and it would enable those services quite quickly. They didn't have to log changes or engage third-party people to help them and things like that. So, the key thing and why we built this is we wanted to make sure DR automation was done automatically. So, people usually say they don't know what went wrong and where it failed over to. But now we have a way to automate it and make sure that we can actually make sure our services run 24-7, I would say. And the cool thing about this app, which is quite awesome and I love the app, the teams themselves are in control of that. So, you start pushing teams to become high-performing teams that don't have to go to multiple people. So, Yuri, I'm going to give it back to you, but that's an overview of the internal adoption. I'm excited to see the demo with you. I'm going to ask a few questions. So, let's go. Thanks a lot, Zak and amazing overview and very rewarding feedback on KGB. Thank you so much. All right. So, we can pretty much jump straight to the demo. So, the demo setup looks like following. So, we have two Kubernetes EKS clusters in AWS. I'll demonstrate environment with my CLI. So, one of the cluster is in U West One and another cluster in AF South One. So, that's exactly a couple of regions that we mentioned before that we operate. And in each of them, we already have a KGB brain stop. So, it's all up and running. Actually, we can quickly get through the components. So, that's the main controls, main KGB controller, the core DNS that's responding to DNS requests and uses a special plugin to coordinate with the special CRDs and external DNS that considers the zone validation. So, what is actually prepared for this setup? We deployed a sample workload and created an associated GSTLB resource. So, the GSTLB resource looks very simple. Our API namespace, IPA groups are kind, standard metadata. And as you can see, this part of GSTLB spec is like one to one to standard ingress resource. So, it says standard ingress spec but embedded into GSTLB CRD. And we are composing this ingress spec with the GSTLB strategy. In this specific case, it's a failover and we are pinning the primary to be U West One. And maybe it's coming there, right? So, from an ingress perspective, if teams know how to use ingress, they will be able to use KGB because it's almost like, like, like you said, right? Yeah. So, basically, if ingress is already exists in a home chart or something, there are two ways. Like, we can extend home chart is a GSTLB resource or we can annotate existing ingress with a special, with a special KGB annotation and KGB will react accordingly. That's cool, right? We are using this trick for internal adoption as well. So, yeah, we have a GSTLB namespace where we had a, where we have a standard to very popular, not standard, but very popular for application deployed. And we are using it to test the GSTLB function. And we have GSTLB already applied here with strategy failover and U West One primary geotape. Exactly this that I showed before. It's already running so we can look into its runtime status and basically YAML spec. So, status depicting the current geotech and already a healthy record. So, it populates the associated publicly available IP address into DNS endpoint because the current backend faults are all healthy. And it goes, it makes this health check very standard transitive way. So, we have ingress, we have a host like a DNS FPDM to respond to and we have a backend service frontend for the info. And basically, once this service has more than zero endpoints, you just treat it as health. And that's exactly how it's getting populated internally in Kubernetes according to the pod health check status described. You will show it better. Yeah, so this is endpoints array and even points array is not populated. KGB will treat this workload automatically on health. So, that's how it works in terminal. So, we can run a very simple script which will test this specific FPDM over FPDM testing, FPDM failover test. So, that's exactly DNS name via testing. And currently it's resolvable and it returns exactly these IPs that are treated as healthy and returned from the ever being data center. And just in case, they are equal to the IP addresses of a network load balancer and AWS setup that is exposed. So, which is associated to ingress the engineering controller in this specific reference setup of ours. So, we can run the test script. What it will do is actually go coroll over these. So, that's like a user connecting to that endpoint all the time. Yeah, pretty much. So, I'll show the whole chunk. So, it's a standard coroll, right? So, it's a sample output and we intentionally populating custom message with the geotech according to the pod location, right? So, whenever we reach the AFSOS1, it will be immediately visible in the output. But currently, again, they're using failover strategy. It is pinned to Europe and workload in Europe is healthy. So, setup is set. And we can, before simulating failure and disaster in Europe, we can look around that we have a very symmetric setup in Africa. So, exactly the same workload, KGB is obviously also installed there. Yeah, just double checks that it is Africa. So, now it's AFSOS1, all good. And absolutely the same GSLB spec. Same strategy, no special configuration required. So, it's exactly the same spec as was applied in Europe. Also, Africa is aware that the main primary cluster is US1. That's why it is returning consistent responses, DNS responses with European IP addresses as well. So, even when... Yeah, maybe to all stay here. So, if we set it to brand Robin, we'd obviously see AFSOS1, E-Waste1, AFSOS1, E-Waste1. Yeah, you would see a mixed response. Yeah. But here we are sticking to one or another. So, yeah, we have a symmetric consistent configuration on both of the clusters and everything is working as expected. Everything is fine in Europe. So, let's emulate an altitude in Europe with the standard Kubernetes approach of scaling replicas down to zero. So, I'm just scaling down a testing workload and let's give it one or two reconciliation loops and see how it will behave. So, let's see just a little bit. Yeah, and you can see that it's already five or three in a testing loop. That's important to discuss. So, here we already can see that KGB is reacted and detected the malfunction of application and is returning the African IP set. But for like roughly 30 seconds, we were still hitting like standard ingress five or three whilst application was done in Europe. Why did it happen? Why we had this small window? It is because on this global load balancing scale, we are operating with DNS and DNS has its advantages and has its limitations. So, one of the limitations is operating within a detail value, time to leaf value. And we are trying to keep it as low as possible. For us, it's 30 seconds. It's already pretty aggressive. From operational standpoint, that please keep me honest. For us, it worked pretty nice. Yeah. So, that's a very important point. So, this is failing a major region over. So, from that aspect, the TTL was quite acceptable. All the other backend stuff like databases and things would also have failed over. So, the app would have come up. The databases would have failed over. And in that scenario, we would have been running within a minute. So, that's more than acceptable for what we have inside our bank. Perfect. Thank you, Zach. So, as you can see, currently it did the successful fail over end to end. So, a request to a customer is hitting a pulse in Africa. It can reach to African cluster and look around what is happening there. I think, Yuri, the other key thing about what's being shown with the KGB, if you deploy it to another region, it's just going to pop up as another primary geotech potentially that you can either switch to or configure. So, if you literally did have both regions failing and you had to go to your DR region, which could be the US South or US East, you could be running within five minutes potentially if your pipelines have been set up in that case. So, from a recovering of a service, the KGB installation is such that you can fail it over, even reproduce the installation and get going if your pipeline is done the way that you wanted to set up. You can do it manually, you can do it quickly. And I think that's the key thing for me was the easy, quick way we could get KGB up and running when we lost, say, a data center. And those chaos tests that we do actually show that, right? So, we could lose everything. We'll be up before other teams are up because we have this tool. Good point. Thanks a lot, Zak. And yeah, very cool moment. We can pretty much move the tech if you like, right? To another data center and repeat. So, back to our scenario. Again, consistent response. Primary is still US1. We didn't make a repeat. It's just like secondary cluster is responding and secondary cluster is aware that whenever a workload in a primary one will be healthy, it should fail over back. So, let's do that. And we can be, first of all, we need to switch back to Europe. And second, we need to scale workload back. The scale should be healthy soon. Let's see on the GSLB status. It's already picked up the healthiness in Europe, obviously faster given the location to the workload. And what about Africa? Not yet. So, it requires some additional reconciliation loop and also some cross cluster synchronization, like somewhere around 30 seconds. I think this was the cool thing for me. So, even when the data center came back, we're still running and then it would then fail over back to the primary data center without even doing anything. And that was for me like really, really cool to see. Great feedback. And yeah, so it's internally updated and after some DNS detail plus some seconds right, it's already available end to end. So, for end to end test, it also fully failed over back to Europe. So, I guess we can conclude the demo. That's cool, Yuri. Thanks, man. Thank you, Zach. And so, for support. And yeah, looks like we're out of time. So, thanks a lot for visiting us at our presentation. And please visit us at KGBIO. We are very open for any feedback, contributions and any suggestions. Thank you so much. Thanks a lot, guys. Visit the sites. Awesome to see you. Enjoy. Have a great rest of the conference. Cheers. Bye-bye. Cheers.