 Hello, and thank you very much for coming to this talk today. If you are watching this as a recording, then unfortunately I've been unable to be with you today. However, again, thank you for coming to my talk today. Today we're going to be talking about rolling out the red carpet for production Kubernetes clusters with a cube VIP. My name is Daniel Finneran, also known as the BSD box on GitHub and Twitter and other places. And I'm currently leading the engineering efforts in developer relations at Equinex Metal. So to kick things off, a bit of background I guess into the cube VIP project. Many, many roles ago I was predominantly focused on helping customers and end users roll out bare metal Kubernetes servers and clusters. And from that I kind of spawned a bare metal provisioning project in order to kind of alleviate a lot of the issues that I was finding getting these clusters rolled out. So it was quite a simple project to automate the deployment of bare metal servers. It would typically stand up the operating system and then once I kind of got that automated it later kind of developed into a provisioning engine to not only stand up the operating system but then to stand up the Kubernetes clusters that sat on top of it. I then started to kind of take it I guess to the next level by looking at building a CAPI, so a cluster API provider to automate the entire end to end platform. And this is typically where I started to hit a number of problems, mainly around kind of the life cycle and turning these clusters into something that was a bit more kind of production ready. So the kind of typical architecture of Kubernetes clusters are that we have a control plane, which should be made up of more than one control plane node, and then typically as many workers as required in order to run applications and workloads and things like that. So the high available and the production kind of side of things typically mean that we need to really kind of protect the control plane, because without the control plane, we no longer have access to do anything with that Kubernetes cluster. Your workloads, more than likely will continue to carry on running. However, we can no longer change things. We can no longer get the state of everything that's actually running within that Kubernetes cluster. So we need to look at making the control plane highly available. So typically in order for highly available Kubernetes cluster, we obviously need more than one control plane node. But we also need a number of other kind of components in order to provide highly available access and redundancy in the case of in the case of failure. Previously, or in a lot of examples, you will find people having additional nodes that sit in front of their Kubernetes clusters that provide things like load balancing or highly available IP addresses. So we need to add additional capacity to kind of cater for that. And then they will effectively provide access then to the control plane nodes that sit beneath them. So what's kind of the bill of materials that are required in order to provide that highly available access. So if we kind of drill down into one of these nodes that sits before our Kubernetes cluster and look kind of under the covers, we typically need a clustering technology that will ensure that one of these front facing nodes is elected the leader. And in the event of that leadership changing, it needs to be able to reflect that change to the network so that traffic can go to whichever node has now been elected to that leader. And then the capability to perhaps load balance to the control plane nodes that sit beneath it. So, you know, if we kind of look at all this and these are kind of the issues that I was facing this incurs kind of a lot of operational overhead. From an automation perspective, there are a lot of additional pieces of software that are required to provide that functionality we need to automate the load balancing part of things we need to load by we need to automate the clustering technology and the the tooling that we use to provide virtual IP addresses and things like that. This requires kind of all the operational knowledge of that tooling in order to design and implement that into the infrastructure. But from a cluster lifecycle perspective, each of these specific bits of tooling have their own configuration. They're in lifecycle. All of that makes it kind of quite hard to automate. So that's kind of where the project kind of spawned from. And then, you know, kind of, I realized that we could kind of take it a little bit further. So, we have our Kubernetes cluster up and running. But, you know, pods inside a Kubernetes cluster typically can't be accessed from outside the cluster. And prior, we would need additional technologies in order to expose these to outside worlds out the outside world. So a service of type load balancer is typically used in order to expose a collection of pods to the outside world through an external IP address. So I kind of realized that, you know, I kind of already implemented a lot of the technology in order to do that. And that is where QVIP went from not only providing just highly available Kubernetes clusters, so production ready Kubernetes clusters, but also being able to provide load balancer functionality to allow external access to pods inside your Kubernetes cluster. So this is kind of what I'm going to be talking about today. So QVIP backgrounds, I've already done that. But we're going to be looking at kind of the architecture of QVIP and the protocols that it uses in order to expose things to the outside world. We'll look at how the highly available Kubernetes clusters actually look like with QVIP. We'll discuss kind of the load balancer services, what that looks like and how that works. A little bit of the roadmap in terms of things that we're working on next and then hopefully time for some questions. So in the initial design, QVIP was designed to sit outside of the Kubernetes cluster. And in order for it to work, we needed a way of having a leadership election. So we originally opted for using RAFT, which already runs inside Kubernetes through SCD. RAFT requires an odd number of members in order for the election process to actually work. And RAFT effectively works by having elections on a regular basis. And if one of the nodes becomes ill or doesn't respond in time, etc. Then there will be a new leadership election and one of the other nodes then will become the leader in the cluster. Unfortunately, this worked in order of standing up Kubernetes clusters and node would become leader and do all of the highly available side of things. However, unfortunately, things like upgrades and in some cases node failures, we would typically end up in a position where RAFT was unable to elect a leader. And without that being a leader, there was no longer a running cluster. So we decided to look at alternative ways of having leader elections. It turned out that there was an easy way of doing this. Kubernetes, the Kubernetes API actually provides a functionality called leader election. So we can effectively make use of that in order to provide our leader election. So using kind of things like the Kubernetes Go SDK, we can effectively have some code that will connect to the API and say that it wants to participate in this leader election. So typically KubeVIP has this code within it. And a number of kind of KubeVIP instances will connect to the Kubernetes API and say, I want to hold this lease, I want to be the leader. Kubernetes API will then make the decision and say you can hold the lease. And at each point, whichever instance of the code holds the lease now is the leader within that election. And typically a node can relinquish that lease when it shuts down, or if it becomes unresponsive, then a timeout will occur and the process will restart where all of the other nodes that participate in that leader election will also ask for the lease and one of the other nodes will then get that lease. So this allows us to have that technology of A, having one of the participants become the leader, and then in the event of failure or upgrades or anything life cycle related, if something times out or fails, one of the other nodes can have that leader election and become the leader within that cluster. So when a new node becomes a leader for the first time, it will need to inform the network, the wider network that traffic should come to it. There are two technologies that we rely on within QVIP. The first one is ARP, which is a layer two protocol. And effectively ARP allows a node on the network to effectively broadcast and update that network so that that network knows to send traffic to it when it is wanting to send traffic to a particular IP address. So in this kind of quick example on the left we have two nodes and when one node comes up for the first time and it wants to tell the network that its IP address is linked to this particular MAC address. So a MAC address is a hardware address that's built into the network card. It can broadcast to the network that to get to this IP address, send your layer two traffic to this MAC address. So it effectively updates the network to say IP address to this MAC address. And that effectively is how layer two traffic actually works. When we want to send traffic to an IP address, we look up on an internal table which piece of hardware do we want to actually send the traffic to. And in the event that we have a leader election, we then do that layer two update so the network knows where traffic should actually go. BGP is a layer three protocol. So effectively what that means is that devices can publish routes to a networking device so that when traffic is routed through them, they hold the knowledge of where that traffic should go as a next hop. So in this example we have a number of servers that are all running. We have a top of rack router and then we have a client device which in this case is a laptop. All of the devices that want to share these routes will need to participate in what's called peering. So they will need to connect to that router and advertise that in order to get to a particular address or range, then traffic should be routed to them. So in this example we can see the second server here, the dot 21 server. It also has additional IP addresses. So it has the IP address 10.0.2.5 and it is peering and advertising to the router that all traffic that needs to get to that 10 address should be sent to the 192.21 address. So what that allows the router to do is know which node to send traffic to in order to get to that additional to that next step IP address. So when the client wants to get to 10.0.2.5 it connects through that router which could be its default gateway. The router then knows to send traffic to that dot 21 host which will then allow the traffic to get to that dot 2.5 address. One additional benefit of BGP is that it typically offers load balancing out of the box via the router. So we can see we have multiple nodes all advertising that 10.0.2.5 address to the router. And then a client that goes to that address will be sent to one of those nodes that is participating in that peering. So kind of a quick overview in terms of kind of the pros and cons between the two. ARP is a standard protocol that's existed in network equipment for many years. It doesn't require anything special. BGP however does require layer three routing. So either top of Rack switch or a router the supports that is actually required. And some hardware vendors may require additional licenses in order for it to work. ARP poisoning can disrupt the network. So ARP poisoning is where a malicious actor on the network starts advertising false IP addresses to false MAC addresses. And effectively what that would allow a user to do is to kind of black hole traffic. For instance, I could tell everybody that in order to get to a particular IP address, they need to go to a fake MAC address. At which point people will start to see traffic failing because their ARP tables have been poisoned. However, BGP can mandate authentication. And you can have ACLs and rules on the routers, which would stop malicious routes being advertised. Some switches in virtual software and physical switches actually can restrict ARP updates. So for instance, if things are changing too rapidly, some switches will start denying those broadcasts from hits it going further on the network. BGP because it's layer three requires firewall access, possibly on a UDP port to the router. ARP is fantastic for small networks, doesn't require expensive or clever hardware. So it's really good for edge and smallish network segments. Whereas BGP ultimately kind of powers the internet. So we can kind of see how well BGP scales. So we've kind of discussed the clustering technologies that power QVIP. We've discussed the networking protocols that we used to update the network and allow access into QVIP. We're going to discuss how you actually get QVIP actually deployed. So using leader election as part of the Kubernetes API means that we actually run QVIP inside of Kubernetes now. So raft has largely been deprecated because it was deemed kind of unstable. Now we run Kubernetes as a pod inside of Kubernetes, which connects up to the API server, participating leader election, etc. So two common methods, either a static pod or a demon set. And both of those methods both come with their own unique quirks to a certain degree. So static pods. Originally with QVIP, I was using QBADM, but still I'm using QBADM. But that effectively would come with a chicken and egg scenario in that QBADM in its has a check which tries to connect to the virtual IP this floating IP address as part of the installation procedure. Now, what that means is a QVIP needs to be stood up and advertising this virtual IP address before the QBADM in it fails. You know, how can I deploy a pod to a cluster before the clusters installed. And this is where kind of the static pod mechanism kind of came in. So what we can do is we can kind of simulate the behavior of QBADM in it in that QBADM in it will effectively populate a number of manifests in the Kubernetes manifests folder. And then Kubelet is effectively told to start up by QBADM in it. And then QBADM in it will start all of those static pods that are those manifests in the manifests folder. What we can do is we can add our own static manifest for QVIP inside the manifests folder before we do the QBADM in it. And this basically allows QBADM in it to sit alongside all of the control plane components. And QBADM will just start up alongside of all of those. So the API server will start the schedule will start at CD will start and QBADM will also start at the same time. Which means it can connect to the API server and it can expose that control plane IP address to the outside world. With a demon set. That's much simpler. In this example we're using K3S from Rancher. And with this we've effectively given an additional piece of information which is the TLS Sandline which effectively means that K3S will add an additional control plane IP address to our certificates for the API server. So we can stand that cluster up. And then once the cluster is actually up and running. We can do a Qubectl apply of the QVIP manifest as a demon set. Bring up those additional control plane nodes and it will just scale across it will come up and it will start advertising that 10.0.2.5 address as a as a highly available virtual IP address. So it really depends on you know kind of a the Kubernetes distro that you're deploying or the installation mechanism that you're actually looking at going down. Some of the cluster API providers that are using QVIP today, typically rely on QBADM as part of the bootstrap bootstrap process so they typically will deploy that manifest inside the manifest folder as part of the cluster API provider functionality. So, you know, as I mentioned with some of the the technologies mentioned we now have all of the components required to provide high availability. We have that clustering algorithm that will ensure that in the node failure and upgrade another node will take leadership and continue to receive traffic. And the networking topologies will be updated should there be any changes. So what does it all actually look like. In this example, we have our three control plane nodes. We no longer require additional nodes at the top to provide high availability and things like that. So we've already reduced the amount of nodes that are actually required to provide that high availability. So we can see we have our three nodes as part of the control plane. They're all running the control plane components. They all have static pod manifests, which means that QVIP is actually running on them. And we can also see the first node has been elected leader, which means that it carries the virtual IP address. So in the event that node number one is removed or as part of the upgrade procedure, the QVIP pod will be terminated. At which point one of the other nodes at the part of the cluster will do that leader election. They will ultimately be given the lease. Once they have been given that lease, they will start advertising that to get to the control plane. I have the VIP and send traffic to me so from an end user perspective. You may see one or two ping loss during the the failover, but it is very quick in some cases almost instantaneous. The leader election happens as soon as the other node is removed and our broadcasts will ultimately update the network immediately. From a BGP perspective, either a demon set or we have static pods here. All three of them carry the VIP, but they carry it on a internal adapter. So what that means is that that 10.0.2.5 address isn't directly accessible by the outside world. Otherwise we would end up in a position of conflicting IP addresses. That's something to be aware of, but it's not something that would trouble you due to the architecture of this. And effectively, what that means is that a client that wants to connect to the Kubernetes control plane, who wants to get to 10.0.2.5 will connect through the router. And that router will ultimately send traffic to one of any of the nodes that is peering. So we get load balancing from that router automatically in the event that one of the nodes dies. So we can see here number node one has become inaccessible. That node will stop advertising, it will stop being part of the peering for those routes to the IP address. At which point the router will no longer be able or will no longer have that route and will no longer send traffic to that node. So as soon as that node disappears from being able to peer, traffic will no longer be sent to that node. So this provides instantaneous failover in the event of upgrades or failure and things like that. So we've kind of covered kind of the highly available part of KubeVip, how it typically works, typically it sits alongside the control plane components, participates in leader elections and advertises a control plane IP address to the outside world and updates in the event of failure and things like that. Taking it to the next step was effectively using those same technologies for Kubernetes services. There are two components that are typically required in order to provide the functionality. One is a CCM, which is a cloud controller manager. And then the second is something to provide that networking magic. So in this example, KubeVip. So what is a cloud controller? Well, a cloud controller is effectively the secret source when you want to run a Kubernetes cluster on your own or other people's infrastructure. So in most cloud providers, it provides effectively that translation layer between Kubernetes objects and the infrastructure where it's actually running. What does that kind of mean? It means that when I want to do something in a Kubernetes cluster, the CCM can kind of translate that into the infrastructure of where it's running. So for instance, on places like AWS or Google Cloud and other cloud providers, the cloud controller will allow us to speak directly to the infrastructure of AWS. So if I require an external IP address and EIP and things like that, that cloud controller can speak to the AWS or cloud providers, APIs, and get that bit of information for us, specific to that infrastructure. So a CCM for your infrastructure is slightly different. Everybody, especially on-prem, has different topologies, different network ranges, different architectures, and things like that. So your own CCM needs to be very flexible for a large number of different infrastructures. It needs to be configurable for different networks and network ranges. And ideally, you know, kind of capable of plugging into things like existing IPAM or other infrastructure management tooling. So there is work that's taking place at the moment on the cluster API project in order of having an IPAM controller, which typically would allow you to give it the knowledge of your network ranges, and it will provide IPAM functionality. So a CCM, how does it, what does it do from a Kubernetes services perspective? So I'm doing kubectl expose of NGINX, and we're creating a load balancer service. So if we look at what has been created here, we can do a get service and kind of describe it. We can see here that we have a service, but it has not been given a load balancer IP address. So as I was saying with an EIP from AWS, for instance, the CCM's role is effectively to be able to link the two together. So, you know, kind of what what what happens here. So we have kubectl deployed on our workers. In this example, we can have it deployed as a demon set or as a replica set tied to a specific set of nodes, for instance. So we have done that expose. We have a CCM running so kubectl has its own CCM. However, kubectl is designed in a way that doesn't tie it to any particular CCM. And I'll kind of show why in a second. But this allows end users to create their own CCMs and a number of people have done that. So Harvester has its own CCM that kubectl will work with. Equinex Metal has a CCM that kubectl will work with and kubectl has its own CCM. And as mentioned, the CCM's role really is that link between the infrastructure. And when we're talking about load balancer services, here its role really is to just update the spec and give it that load balancer IP address. So how does kubectl actually work with that? So kubectl has what's known as a watcher inside it, which is code that speaks to the Kubernetes API and can watch for changes of Kubernetes objects. When the CCM updates that load balancer IP address with an IP address for the network, kubectl will see that the IP address has been added to the service. And it will then use those technologies to advertise that to the outside world. So either through ARP or BGP. That means that any traffic now to that load balancer IP address will be sent to the kubectl pod, which will then hit the services network and be sent to the pods that are part of that service. So additionally, kubectl was then kind of updated to work in a hybrid mode and this is mainly for kind of edge deployments. This allows kubectl to do both HA and services through the control plane. So they will all participate in that leader election for the highly available VIP. So traffic will go to the kubectl pods when trying to hit the control plane. But also when traffic is coming into a Kubernetes service, that traffic will hit those pods and be pushed onto the services network where traffic will be then sent to those pods. So effectively, this allows us to not only do HA but Kubernetes services on the control plane for kind of hybrid and small deployments. One additional feature that we added was DHCP load balancers. Again, this is kind of useful for edge deployments. So effectively what this allows us to do, and we can see in the example there, we've specified a load balancer IP address of 0.0.0.0, which is a valid invalid IP address. But what that effectively means is that the Kubernetes API will accept it as an IP address for a service. And what actually happens here is when we specify a service with that IP address, the kubectl pod will actually do a DHCP request to the network where it will be given an IP address from a DHCP server on that network. In this example, we have like a home router that typically provides DHCP inside your house or inside an office and things like that. It has given QVIP an IP address of .123. QVIP now will update the service with the IP address and then use that to expose the service to the outside world. So we effectively get free topology information from the DHCP server. This works great in small networks where edge environments where we can leave all of the IPAM knowledge to whatever it is that's providing DHCP functionality inside that network. So that's some of the additional functionality around QVIP. So we've covered a bit of the back story. We've covered how high availability works. We've covered how CCMs work and how QVIP can use that information then to expose access into the services network for services of type load balancer. So that's kind of where QVIP is at the moment. From a roadmap perspective, a few months ago we submitted QVIP to the CNCF sandbox. It has now been accepted as a sandbox project, which is fantastic and I'm very proud of that. From a new features perspective, we're looking at improved control plane load balancing. So at the moment, control plane load balancing is actually not enabled by default. We're looking at either using IPvS or the maglev project in order to load balance across nodes that are control plane nodes. ARP doesn't provide load balancing across multiple nodes. So effectively what that means is that whatever node is the leader receives all traffic. So we're looking at methods in order to distribute ARP load balances and there's work that's actually happened there. Enhancements to BGP. We've already started to work on observability and monitoring. OSPF is an option for routing traffic. External DNS updates for services. So work has already began here where we can do a Qubectl expose and QVIP can then update external DNS providers with both the VIP and service name, subdomain, etc. We've already done proof of concepts for bind and cloudflare. And then additional improvements around IPv4 and IPv6 and then a lot of work around documentation updates. So all of the documentation is at QVIP.io. Everything for code is all on the QVIP repositories. With that, thank you very much. And if you are using QVIP, I hope you enjoy it and it is doing what it is meant to be doing for you. If not, please raise issues and let me know. Thank you very much and enjoy load balancing with QVIP. Thank you.