 All right, thanks a lot for joining. Let's chat about surviving the original failures. So first speaker slide, let me introduce amazing end user of the project we're going to talk about. Nuno. And creator of KGB, Yuri. Thank you. So show of hands, whoever heard about KGB, the Kubernetes global balancer before. Amazing. Much more people than I expected. I appreciate it. So what is KGB? Kubernetes global balancer open source project sandbox one. It is designed to solve the global server load balcing problem and challenge in Cloud 94. So it is running on top of Kubernetes. It is designed to be as simple to use as possible. It just reconciles a single CRD of JSLB type. It doesn't require any management cluster, so that's a strong point of the solution. And no, in current, no single point of failure. So and design on top of well, battle tested DNS protocol and supports a pretty good set of global load balancing strategies. So a little bit of project history in context to why it was even created. So it was originated in Upsa. And Upsa is amazing South African Bank. I had a honor to work for. And they are running, I believe, is still running a huge Kubernetes fleet, more than 100 clusters. And they required some form of JSLB solutions that is cloud native. So that is aware of Kubernetes primitive, what is happening down to the pot health level. And back in Upsa, we needed geographically a solution that would steer the traffic to geographically disparate clusters. And this geography could mean different data centers or different regions in AWS Cloud or actually a mix of them. We tried to find some proprietary vendor solutions. There was nothing that was cloud native enough. And basically, most of the solutions, global load balancing ones, they are relying on a simple straightforward HTTP checks. And that's just not fast enough for cloud native forward. So that's why we decided to build a solution from scratch. And it was something that we developed in open from day zero. So it wasn't something that was born internally. Then open sourced. We were starting it as a OSS project from the very first day. And I think it very positively affected the overall design of the project. And we never made any shortcuts. So in a nutshell, how it works, very simple thing. You have at least two clusters that are geographically dispersed. In this specific example on the diagram, you have one cluster in the European region or data center, another one, a secondary one in the United States. We are running KGB controller on both of the clusters. So we are putting KGB next to your application workload. So again, no control cluster, no separate management layer. Everything is distributed. And we integrate it with some environmental data. And this diagram, it is relatively, but it's just one of the examples. You can integrate with virtually any kind of DNS server depending on your environment. Assuming workload, for some reason, dies on the first primary cluster. And it cannot recover itself on standard Kubernetes manner. Then KGB is smart enough to steer the traffic to the secondary cluster. And basically, it makes your application globally highly available. So that's roughly the thing. It's worth to mention that like a core principles that were imprinted into the project from very, very first day. And thanks to Donovan from APSA for actually put it on paper even before we started to code it up. So it should be Kubernetes native, meaning it relies on the Kubernetes primitives as much as possible. So we are not reinventing the view. We are not creating some additional health checks. We are designing. We are checking the pod health status just out of the standard Kubernetes service primitive. No service endpoints available. It means that pods are dead. And application is more functioning, right? We have as simple as possible interfaces possible single CRD that is tightly coupled with a very standard ingress Kubernetes object. Again, no control cluster is very powerful. So we distribute the solution next to the workloads. And it makes the solution itself highly redundant. Based on DNS, so yeah, DNS runs internet. We obviously also inherited some DNS limitations. We will talk about later. But overall, it's very reliable stuff. Environment agnostic. So we are trying to be as much as independent from environment as possible. So for to integrate KGB into environment, you need a very simple stuff. It's a zone delegation, meaning that's just a couple of NS records on your environment DNS, like R2-2.3 from the previous diagram. And it is NS type record and a gluer record, and that's it. And it is usually done automatically by the KGB solution. So to the user interface, really just one single CRD. And as you can notice, it has embedded standard ingress in its spec and a strategy section that is dedicated to actual global load balancing strategy. So here, the mandatory ones are actually just a type. And in case of failover, we also need to assign a primary geotech. The detail times, they are just something to override the defaults. To say, it's so important to start with enabling global load balancing. So basically, you're just dropping in this custom resource. And assuming KGB is installing cluster, at least two clusters, it will create a really global load balancing strategy for you and stir the traffic. And you instantiate GSLB kinds on both of the clusters. So they will be able to discover each other. If you already have a form or some form of Helm charts, for example, your development teams already have a standard application package, including the ingress. And you maybe don't even want to teach them yet another new CRD, even single one. There is a possibility just to put the annotations, KGBIO, and the control will pick it up and create CRD automatically in the background. So your development teams or people who are dealing with Helm charts don't even need to learn that stuff. So that just declare the strategy and associated geotechs and it will just work. So very simplistic. And we're doing gateway support next, right? Yeah, gateway support is on the road map. So gateway also knows this kind of ingress V2. We are trying to follow the latest versions of cloud native APIs. So yeah, it's definitely on the road map right now. Load balancing strategies, so actually the core functionality. So we started initially with a simple round robin and a failover. And I think they're still the most popular strategies that the people are using. So round robin, you have two clusters and you put them into GSLB. You put GSLB CRs on both of them. And after that, the DNS responses are going to be provided in a random manner, right? So from both clusters, and non-predictable by design. Failover is most popular probably straightforward. It's the one that we showed in diagram. You have a primary cluster, secondary cluster. Assuming the workload is dead on a primary one, it will steer the traffic to a secondary one. When workload is healthy again, it will steer it back. Super straight forward. Then after project matured, we got some more advanced load balancing strategies, like a weighted round robin. And thanks, Mikhail, if you're watching it, for implementing this. It's actually a very advanced statistical algorithms behind, and we can apply the weights on two associated cluster geotechs. And this way, predictively, balance the traffic between multiple locations. And GeoAP strategy, thanks, Dinar, for implementing it. It is more classic in terms of finding the closest geographical cluster. So from global traffic management perspective. But it's tricky to implement because it requires a crafted GeoAP-compatible database. And it also has a strong environment requirement of your DNS solution, DNS resolver, to have a DNS0 extension. So it is to propagate the client subnet to actually make the DNS-based load balancer aware from where the client is coming from. So it is doable. It is working. But it has much more dependencies than other strategies. So supported integration. So as it was mentioned, KGB can work on any CNCF-conformant Kubernetes cluster, no restrictions, and pretty much any ingress controller. So we heavily tested it on a most popular ingress and a tri-afic, but there is no technical limitation to not to work on any ingress controller that is compatible with standard APIs. And for external DNS, meaning the integration into the providers, into the environments that you are running. So originally, it's the project that started back in UPSA, as I mentioned. And we used to have an Infoblock. So that's the first integration we implemented. Then we went to a public-level support. So it's a WS route 53. And then we had an amazing experience with the partners, with the guys from NS1. And then we developed the RFC 2136. So basically, an ability to work with any DNS, classic one like a bind or Windows DNS, so you can hook into the DNS process and create the zone delegation automatically. And you can integrate virtually in any environment. And what's coming next is Azure public DNS support. And it will be released really soon. We already have a code working, and you will see it today. For observability standpoint, we obviously were thinking about day two operations. And we had a pretty rich metrics around the reconciliation loop of the GCLB type and associated traces. So open telemetry, compatible Evans, are getting propagated through RTL collector. And thanks, all to you for implementing it. That is amazing contribution. So usually, at this point, when I deliver in KGB-related talks, I am showing the live demo. But today I have amazing user with me, so I can just relax. The stage is yours. Thank you very much. So for those of you who don't know Millennium BCP, we're the largest privately owned bank in Portugal. So from South Africa, second biggest user is Portugal. Banking as well. Our goal at Millennium is to use five geographical regions in Europe, covering Azure, GCP, and on-prem regions as well, and to basically do the same thing throughout all of those regions. And by saying the same thing, it's infrastructure as code and GitOps. What that means is that, for instance, our developers don't particularly care where things run. That's on the infrastructure teams to ensure what our developers ensure on their side is that their outputs gets converted to an OEM file, an open application model definition file. And then, IFRA picks up that file and does things. Some of those things includes by using GitOps to pull that file into a management cluster, having Kubevala and Crossplane spin up everything that needs to be spun up. And that includes geographical distribution. Yeah, maybe if I can interfere with the amazing combination of Crossplane and KGB. I couldn't not to mention it. I still need my hotel to be paid. Well said, well said. And even if KGB is simple to integrate, honestly, our users don't care about it. Their thing, our application, just runs. What we typically use in terms of run environments is this sort of pattern where there's a management cluster somewhere that runs, well, management planes, things that need to be, well, they have a state that needs to be backed up, for instance, Crossplane, Kubevala, and other planes. And the only thing that those management clusters do is spin off things to run clusters and the run clusters themselves. For us, a run cluster is a pair of communities clusters. That's how we insured those 99.95 SLOs. And they are not only multi-zone, they are multi-region. So for instance, this pair, one of them is in Azure North Europe. The other one is in Azure West Europe. And KGB does its thing. Let's see how. So the URL is live. You can hit it now if you want. End code is in this URL down here. Feel free to browse if and when you guys want to. So maybe while you're firing it up, worth to mention that it's a very similar case that we had in Opsa, and you have in a millennium BCP that applications are getting deployed on top of at least two clusters just by default. And it is totally transparent to the users, right? So is there just by default globally available? And this global availability is driven by KGB? Yeah. So quick overview of the code you can find in these repos. You'll find a repo called IAC. And here, we are just spinning up the resources for the demo. Network, Key Vault, Management Cluster. And very long, or not, Terraform, Pacifico to call that high availability, AKS module, and not going into too much detail what we have here are two instances of Terraform that basically spin up a cluster, get flux into it, get external DNS going, get the external secrets operator in there, and generate also the initial files for GitOps, which is the other repo next to this one. So here in Cluster Config, these folders that you see, Run North and Run West, were generated by Terraform. And in the end, what we have here in the Azure portal are two run clusters, one in Earth Europe, the other one in West Europe. And we'll be targeting those two clusters. Since all of this is automated and no one actually ran kubectl whatever into those clusters, we also needed to take care of deploying KGB there. So what we do is to imagine it. We have a folder in the Git repo for KGB and namespace stuff and a generalized Helm release definition for Flux to pick up that basically has some variables for Cluster GeoTag and the pair GeoTag for that cluster. Customize, put the right values there, and, behold, you have two clusters connected to each other. Some control plane ran the Terraform module and we're up and running. And by the way, this is how we do service continuity and disaster recovery. We just spin up two new clusters if any of those two fail. No worries in recovery, so to speak. This means that when we deploy an application using GitOps to this pair of clusters, and for this demo we're using Hello Kubernetes Docker image, besides doing just the standard Helm release for that app. And, well, the only thing here worth noticing is this app will just say welcome from Kubernetes and there's a variable there for Cluster Name. We are also deploying that CRD. Same thing that was in the previous slide, just saying primary GeoTag is North Europe, Type is failover. That's that. And getting here, we could do it like this, putting the CRD into Git. We could do a kismise patch if we had ingresses already declared in the repo. Now, it's up to you. Thing is, it's one file, either GSLD, CRD, or kismise patch. With this in place, I knew that was kind of going to break, so welcome to my curl test. Is this stuff accessible? No. Just in time. Security's read. What is it? All right. It's blocked by conference organizers. No, that's cool, isn't it? Access point. We have each Wi-Fi. No, it's on cross-planes. Double check. It shouldn't be Ryan cross-planes, I believe. It's on up-bound. OK. Try to reconnect. Sorry, we are making manual Wi-Fi failover. No, it's on HTTP. You can access HTTPS. Yeah, yeah, yeah. Thank you. Appreciate it. So plan B. OK. Plan B. Today? Last attempt. Last attempt, and we have a last resort solution on my local setup. I will be able to show you. Come on. Yeah, yeah. Go for it. No, I've got it. I've got it. OK. Nice. Now, you know what it was? Network cable. Oh. Up and running. So as I was saying, we have North Europe cluster replying currently. So scenario one, you're going to be doing plane maintenance, whatever, on this cluster. You want to shift everything to West Europe. So you could do something as simple as, here's my JSLB, customer resource. I'm going to say this is going to West Europe. And a git commit away. There you go. And just to save us a bit of time, just going to run flex reconcile on both clusters, just for us not to wait for flex to sync. And with this in place, we should see that the reply will move from North to West. Let me. And when? Now, West. You see? Do you want to do a quick explanation of what happened behind the scenes? Yeah, so basically, KGB detected that Pots are dead on our primary cluster. And thanks to the zone delegation that is configured environment as DNS, we're able to dynamically craft the DNS responses according to the strategy from our internal core DNS that is part of the KGB and returns the appropriate DNS response. And the reason, well, one of the reasons why this is fast is because if you look at the public DNS zone that's behind this, this is the Azure public DNS zone that we are plugged into, nothing changed. We're not updating records in a lot of places. That's a great point, right? We are not using external DNS to populate it centrally. We are serving the DNS request on our own and core DNS, and only zone delegation is configured. So this is an S record, GSL DNS, and a global record that would point to the set of KGB-enabled cluster nodes. So we're replying straight out of the Kubernetes clusters where the workload is. Exactly. This was planned maintenance. Cool. What about unplanned maintenance? So we are replying from West. Let me get into that cluster and just remove replicas. Which is exactly what it sounds like. Killed the app. Oh, now we're replying from North. No service endpoints in the West Europe cluster. Cool. So I'll just serve DNS to say, go there. North Europe, that's the one. Just like that. And honestly, we are shifting pretty fast now, now that we're out of network cables. But worst case scenario, with all DNS TTLs in the middle and all of that, at most, this takes less than a minute. Right, Yuri? Yeah, basically, to the point of that, we are based on DNS. So it's a great protocol, stable one. But we are also inheriting some limitations. So we are operating with a very short time to lift TTL value. For demo purposes, it's kind of 30 seconds to enable a fast switch in production. I noticed that people tend to raise it a bit, so it will be less aggressive. But anyway, just usually, it is one reconciliation loop within a Kubernetes controller overlapped with TTL. So that's what you can expect how fast the lower will happen. Do you want to talk a bit about internals? Yeah, we have five minutes. And maybe I'll try to quickly explain how that stuff actually works. So we have a KGB controller that encapsulates the core logic of the solution. It continuously watches for GSLB-CR, the ones that we showed before. And we have a very tightly catapult ingress at the standard one. And ingress is associated with very standard Kubernetes application service and your application ports. So as I mentioned before, it's just the service endpoints. We need no additional health check, no additional logic. If service endpoints array is empty, then the application is kind of failed mode. And we have an integral part of the solutions that are happening in the system KGB namespace. We are using DNS endpoint CRD and custom resource instantiations that we are reusing from external DNS project. So basically we are populating the entries dynamically according to the global service load balancing strategy. And we are also running a core DNS, a special built. It's not a Kubernetes system. Core DNS, it's a standalone deployment that is part of KGB as a solution, the integral part. And it has a special built, a special model that is capable to read data from a Kubernetes custom resource. So that's the way it reads dynamically from Kubernetes API effectively and serves the dynamically crafted responses according to the strategy. And for the integration parts that we discussed several times, NS record plus GLURA records that configures its own delegation on HDNS. So again, in this case, it's route to T3. But as we discussed before, it can be anything. We are using external DNS also as a part of the project. So it is able to automatically configure it for you in your environment. So that's what's happening in a scope of single cluster, right? What about the multi-cluster kind of coordination stuff? Where is the state? So the main part, as was already mentioned, is on delegation configuration. So first two rounds of reconciliation loop clusters are figuring out the own health of the local application. And in parallel, they're configuring the zone delegation. So to actually craft a proper DNS response, according to the desired GSLB strategy, we need some form of coordination. And in this case, we are using the same DNS. We are using some environment DNS for each other cluster discovery. And then we are cross-polling associated core DNS on a special FQDN for local targets. So both clusters are aware of each other's healthy endpoints. And in this diagram, it's actually round-robin in action. And they're merging two health API arrays together and returning DNS response in consistent ways. The most important part is that both core DNS that are responsible for the zone, effectively, they are always returning a consistent response no matter which cluster is hit by the DNS client. Yeah, a little bit of CNCF landscape stats. CNCF Sandbox project level, pretty good stats, reasonably active maintenance. We were also a part of CNCF Security Slam North America end of last year. And it very nicely affected the project. We now have a Salsa enabled. We have a Cosine, Sbom, all the stuff, the same fanciest release pipeline server. We have a diverse group of maintainers. So three, from original Apsa, one, a maintainer from Transform and one from Abount, me. So we got this kind of recommendation from CNCF technical oversight committee a couple of years ago to make a group of maintainers more diverse. Meaning like not from the single company, it happened this way. Well, effectively, it happened because a couple of Apsa folks left, including myself, but its implementation details. And roadmap, the very high level Azure integration you already seen in the demo. It works. It's going to be like an X release. We just need to polish a little bit of Helm chart integration in UX. A GCP, like to cover all big three. A gateway support to support the newest kind of next version of Ingress. And a little to my roadmap is fully available. And please, if you like the stuff, please join the project. Try it out. Join us in the CNCF Slack channel. We have a dedicated one. We're usually very responsive. Please star us. It will really help. The project is a project adoption. If you ever, ever will use KGB and you are confident to say it publicly, please send a pull request to AdoptersMD. And any issues and PRs are welcome. So thank you so much.