 So, today I'm going to talk about Metal LB. Metal LB is a CNCF sandbox project that allows us to have services of type load balancer or on bare metal. Some quick words about me, I'm Federico, I'm part of this networking team, I try that in charge of making the OpenSheet platform suitable for telco workloads. During my time there, I contributed to a variety of Kubernetes related projects and network related projects and I've been maintaining Metal LB since a couple of years. In order to understand how Metal LB works, I'm going to introduce or reintroduce the concept of services. Let's say that you have your application, it's running in multiple replicas inside the cluster and you want to load balancer the traffic directed to those endpoints and the construct that Kubernetes gives us is a service. You have an internal accessible virtual IP, the cluster IP, a given pod that can access the cluster IP and the Kubernetes networking will load balancer traffic among all these different endpoints. It's a bit more complex than this, you have various fields like IP family, protocols and ports and stuff like that but that's the gist of it. But what if you have your application still running on the cluster, still running in multiple replicas and you want to advertise it outside of the cluster. One of the constructs that we have is a service of type load balancer. This is taken from the Kubernetes documentation, a service of type load balancer exposes the service externally using a cloud provider slot balancer. Not part and cluster IP service are automatically created. The emphasis here is the fact that it leverages a cloud provider slot balancer and what does this load balancer give us? It gives us an external IP in first instance, something that we can give out that is accessible from the external network, the internet and that can be given out to reach our application. The second thing that cloud provider gives us is the load balancing part. You have somebody trying to access your virtual IP from outside and the cloud provider's network infrastructure will do the multiplexing of that traffic reaching the virtual IP among all the nodes of the clusters. Something that I want to emphasize here once the traffic reached the node, all the rest is centered by the cluster's CNI. The role of the load balancer ends when the traffic reached one of the nodes. A service of type load balancer gives us a couple of things. One is a stable IP to reach our application from outside. We can ping a DNS entry to it. The other part is the load balancing part across the different nodes. Now let's have a look at what happens on bare metal. In bare metal, we don't have anyone giving us an external IP. The first thing that happens is the fact that our external IP stays in pending. We don't have an IP to give out, but even if we had another part that is missing is that part, that network infrastructure that redirects the traffic towards the virtual IP to the nodes to the clusters so that the CNI can do its part. These two particular problems, assigning the IP, routing the traffic to the given nodes are the problems that Metal LB tries to address. Metal LB is a load balancer implementation for bare metal Kubernetes clusters. Using standard routing protocols. The first thing I want to mention about Metal LB is that sadly, or at least it was weird to me to discover that Metal LB is not a network load balancer, meaning that it doesn't implement that part that takes the connection and redirects it directly to the nodes of the cluster. But this doesn't mean that it's useless. So let's have a look at those two issues and try to see how Metal LB solves them by leveraging the external network infrastructure in, I think, a quite elegant way. So the first part is the address assignment. It's probably the most boring and less networking part. We have a regular Kubernetes controller, listen for the configuration and the services, give them an IP or a claims to the IP when the services are deleted. But what IPs are we talking about? Here we are not in a cloud environment. The only bit that is in control of the infrastructure is the cluster administrator. So Metal LB must be instructed of which IPs are available. This is done via this IP address pool CRD. You can have a sider set of IPs. You can have a range. You can have IPv4 and IPv6 addresses. You can pin a service to a specific IP. This is done in two ways, the spec.loadBalancerIP field, which is now deprecated but still working or unannotation. This is how you ask, hey, I want my IP to have this particular IP. Or you can ask for some IPs from a given pool, still in the service definition. Or from a cluster administrator's point of view, if you implement some kind of multitenancy, you can say this particular address pool is reserved only to this subset of the namespace. Or you can do even more and say this particular set of IPs is reserved only to this particular set of services. A couple of things to note. No selector means that IP address pool can be applied to all the services. We have a priority field that needs to be taken care of. If we have a service that asks for a specific IP like this one, for example, but in the pool configuration it says you can't have that IP, then the service IP will stay pending. And then let's talk about the other advertisement part, which is the more network-heavy part. So, again, we have somebody trying to access the virtual IP from outside, and we need to find a way to attract that service to our nodes, and we have something in the middle. And another thing that Metallaby doesn't should do is to honor the local traffic policy, meaning that if we had the pods running only on the subset of the nodes, Metallaby must be smart enough to attract the traffic only to those nodes where an endpoint is running. There are two ways to attract the service. Two advertisement modes. One is L2. It's quite simple. It requires some GNAV stick in the network or to have the client and the cluster to be in the same subnet. And BGP, which is more complex and more powerful and requires the interaction with a BGP-enabled router. Let's start with L2 mode. Again, the easiest way is to have the client on the subnet of the cluster. The weight work, it works, is more or less similar to what KIPA LiveD does. KIPA LiveD does. You have the client tries to know who is owning this particular IP, sends out an ARP request. The ARP request gets to all the nodes of the cluster. Metallaby, for a given service, elects a leader. A node that is going to reply to that ARP request that is on per service basis. Different services might have different leaders. The ARP reply gets to the client and then the client is able to reach the service. One thing to note is that Metallaby replies with the MAC address of the interface it receives the request from. So if it has multiple interfaces, it still works. What happens when a node fails? Failover happens. A new leader is elected, sends out a gratuitous ARP, reaches the client. The client is able to reach the service again. The traffic gets to the node. All the rest is done by the CNI. Again, a couple of nodes. It listens to all the interfaces unless you put an interface selector. We don't assign the IP to any interfaces on the host. Only one node is active, so it's not really load balancing. It's more a load balancer implementation, a load balancer service implementation. When a failover happens, a new election, a gratuitous ARP, the configuration is pretty easy. Address pools and a simple L2 advertisement. This is to say I want all the services to be advertised via L2. You can say I want only the service whose IP is coming from this set of IP address pools to be advertised via L2. And we can also have a node selector. So if you have your only subset of the nodes belonging to a given subnet, if you use a node selector, Metal LB will elect the leader among the subnodes and matching the node selector to reply to the ARP request. And you can specify a set of interfaces. So if you have a fancy interfaces combination that might cause troubles, this is how you select only a subset of the interfaces to instruct Metal B to reply from. Couple of nodes, the interface selector doesn't influence the way the leader is selected. If you are choosing one, if you specify an interface selector with no existing interfaces, the service won't be announced. If you don't specify any address pool selector, it means that the L2 advertisement is applied to all the IP address pools. If you have multiple L2 advertisements, those L2 advertisements matching a given IP address pool are merged together. The interfaces are the union of the selected interfaces. The nodes are the unions of the selected nodes. So again, with L2, you don't get real load balancing. It's slightly similar to Keep Alive D, but it works. BGP mode. BGP, this is taken from the RFC. The primary function of a BGP speaking system is to exchange network reachability information with other BGP systems. So this is exactly what we need. We need to exchange the reachability of our service IP through our nodes. So in BGP mode, each node acts as a mini router. It establishes a BGP session upfront with an external router. MetaLED must be instructed of the presence of that router. When we want to announce a service, these BGP messages are sent to the router, saying if you want to reach the service IP, then the next stop is the IP of the node, and the same for this one. So then the router's routing table will look something like this. To reach the service, you have this set of next stops. And if the client tries to reach the service, now we get real load balancing given by the router. It's ECMP routes equal cost. It's active, active. So the traffic will go sometimes on one node, some other times to the other. Once the traffic gets to the node, all the rest is done by the CNI. And of course, you can have more complex scenarios. You can have multiple routers. You can have spine and live routers and all the fancy stuff that the BGP allows us to do. How to do the configuration? We still need an IP address pool to give the IP to the service. And then we need to instruct MetaLED about the presence of a given peer with the address, an autonomous system number of MetaLED and of the peer, and some other details about the BGP session. And then you need to tell MetaLED to advertise the services via BGP. This is the simplest way as L2, an empty one just tells MetaLED to announce all the services via BGP. But BGP offers a lot of extra configurations. We have a node selector for the BGP session if your router is reachable only from a subset of the nodes and you don't want to waste resources in trying to connect to a router that is not reachable. With the node selector, the BGP session will be established only by a subset of the nodes. We have communities and local preferences that allow fancy BGP configurations. Like for example, specifying the fine-grained 32 router to the local router and the coarse grained to the external router. We support BFD. BGP max failover detection is around three seconds. BFD is much faster than that. And by specifying a BFD profile, we will have the BGP session backed up by a BFD session to provide faster failover and broken link detection. As much as with L2, we can specify, we can tell MetaLED to advertise by BGP-only the IPs coming from a given pool of IPs. We have node selectors if we want to advertise. This is orthogonal to the BGP session node selector, but we want to advertise some IPs only from a subset of the nodes. We have a node selector in the BGP advertisement. And we can say, I want to advertise this set of IPs only to a subset of the peers, instead of all of them. Again, some quick notes. No IP address pool selector means the BGP advertisement is applied to all the services. When you have multiple BGP advertisements matching the same IP address pool, they are combined together and applied together. Quick recap about BGP mode. You now get active-active load balancing. It's handled by the external routers, not by MetaLED, but it's still active-active and load balancing. We need some extra configuration. We support BFD. One thing to note is that as of today, more on this later, MetaLED is not meant to accept incoming routes, so it refuses incoming routes by default. We have a ton of configurations, node selector. We support IBGP and EBGP, single and multi-hop. Of course, all of these can be mixed together. MetaLED is meant to be simple to configure when you deal with a default. You just configure an IP address pool and MTL2 and BGP advertisement. We also have complex configurations that can be achieved with all the configuration knobs that we just spoke about. Those extra knobs came out from long discussions with the communities and with the MetaLED users, and they evolved over time. For example, with the current configuration, we can express some nice things, like announcing both via L2 and BGP a given service, announcing via L2, but only from a subset of the nodes and only from a given interface, or announcing only to a given BGP peer and only from the nodes with a given label, and so on. Announcing all the services to the peers, but only some services to a subset of the peers, and announcing all the services of a given tenant only via a specific BGP session. This is to say that now MetaLED has a very comprehensive configuration that should cover the majority of the needs, and if not, just come and file an issue. I'm going to talk a bit about the architecture. Let's have a look under the hood of MetaLED. We have the two categories of pods. One is the controller. The controller is the boring one that listens for the services. The IP pools allocates the IPs to the services, gets back the IP when the services die, so that IP can be given to another service. The speaker on the other hand is a demo set, an instance per node. It's a host network pod. It has to mess up with the host network, and it handles the IP announcement. In L2, we have an ARP responder. The speaker listens for services, does the leader election for each service, and creates an ARP responder for each service. On BGP, we have several implementation. We have the native one that was the original one. It's a subset of the BGP protocol implemented in Go. It listens for services plus MetaLED configuration and reacts to the BGP session according to what it has to do, which is what we said in the previous slides. Then there is the FRR mode. This is something that we introduced a couple of summers ago. FRR is a real software router implementation. It is a very popular project, a rock solid, super stable. It implements a variety of networking protocols, including BGP and BFD, among the others. FRR mode is the current BGP implementations that we are investing our efforts into. In FRR mode, the speaker listens for the services, for the MetaLED configuration. It generates an FRR configuration and reloads the FRR configuration inside FRR, so FRR can interact with the router and implement all the protocols for us. By doing this switch, now all it takes to implement something new that is already supported by FRR, and FRR supports basically everything, is to find the right FRR configuration and to translate the Kubernetes API into the right FRR configuration. This allowed us to implement BFD and IPv6 support quite quickly, for example. And again, the native Go implementation is more under maintenance mode. This is where we are putting the majority of our efforts. And this was more or less what MetaLED is about until today, how it evolved based on the user requests and so on. But now I want to talk also about what's next. Of course, I can't foresee the long-distance future, but I can talk briefly about what we've been working on the last six months, which is something that came out from community requests. This is something that come pretty often. Basically, there are users saying, hey, now that you are running FRR, FRR is a full-fledged router. Can you do this and this and this with MetaLED? Can you accept routes into your nodes with MetaLED? And I tend to push back because MetaLED's purpose is only to announce services. But given this was something that users requested, together with running multiple FRR instances on the same node, we started thinking about ways to share the same FRR instance between MetaLED but also to be able to use it for other purposes. And this is efficient because you have only one single FRR. You have only one single BGP session with a router that can do multiple things. So this is now available as an experimental mode in MetaLED. The way it works is you have a series of API, the usual API. Now we have a new CRD FRR configuration which is read by this FRR demo that can be deployed as a standalone or as part of MetaLED to the cluster. This new demo reads this new FRR configuration and then FRR is configured to interact with the router. But at the same time, a user can come and apply an FRR configuration that is compatible with the one generated by MetaLED to do more stuff. And this is FRR Kubernetes. FRR Kubernetes is a spin-off of MetaLED is deployed directed by MetaLED if you choose the experimental FRR Kubernetes mode and it can also deploy it as a standalone component. Now I'm going to show a very quick demo. First, I want to describe the development environment that I'm going to show. It's a kind cluster. You have each node connected via the Docker network. Each node is a Docker container and we also have an external Docker container that is running FRR that is mimicking an external router. And now I'm going to show the demo. So, this is the demo. It's a kind cluster and I also have an extra container which is FRR which is representing the router. The way I spawn up this is we have this in-dev-env command in the MetaLED repo that does all the setup for us if you want to tinker with MetaLED, you can specify the backend, native FRR Kubernetes, and you can specify the protocol you want to test, BGP or L2. With BGP it will spin up the external FRR container. And we can also have a look at the external configuration. The external container is meant to be paired with the three nodes. These are the IPs of the three nodes and it's trying to advertise a couple of prefixes which MetaLED doesn't want to receive. If I do summary then the status is active meaning that the external container is trying to connect and it's not connecting because we don't have a configuration yet. Let's have a look at the configuration. BGP peer, the IP address of the external container, an IP address pool, and an empty BGP advertisement which is the bare minimum configuration that we can have. And I can also have a look at the services and we have an NGINX instance with external IP pending because we don't have a configuration yet. But now I can do this and create this configuration and then our external IP has a value. Now MetaLED will try to, MetaLED first is connected. Now the session is up. If I have a look, oh sorry, too many comments. So then the virtual IP is reachable from all the three nodes. And now I want to also quickly show the new FRR configuration. So this is something that will tell the FRR diamond to accept all the routes which is something that MetaLED doesn't allow by default. This configuration is going to be merged with the MetaLED one. We have a node selector only on kind worker. And I want to show the routes. Now then I created the configuration and the external router is still trying to inject those routes and because of this extra configuration now these routes are available. This is it for the demo. You can run it locally. Again, the development environment is pretty easy to spin up. The last thing I want to mention is that we are trying to maintain MetaLED as a contributor friendly project. There is this community and contributing section in the official website. We are active in the MetaLED dev and MetaLED Slack channel on the Kubernetes Slack. We try to keep the list of issues well groomed with help wanted, good first issue if you want to work on something or you think that a feature is missing. Just file an issue, we'll try to engage a discussion. MetaLED is not a huge project but you can have an impact on a good amount of community users that are using it in their home labs or even in big enterprise environments. So it's a nice way to get in the Kubernetes ecosystem and contributing to something Kubernetes related without being too much intimidated. And I'm on the Kubernetes Slack. Feel free to reach out. I'd be happy to guide you into your first contribution. And with that, I guess we can wrap it up. So MetaLED addresses two major tasks. One is the IP allocation. The other one is the IP advertisement with L2. You don't get real load balancing, only failure detection but not high availability. BGP is more complex and more powerful but it requires the interaction with the router. We have upstream documentation. Again, we try to keep it up to date. We have a troubleshooting guide and a few tools. And also if you want to learn more about FRR, their website is very well documented. With this, I'm done. Again, if you want to reach me out, my Twitter handle is Fedepaul. Fedepaul is my Gmail handle. I'm Fedepaul on the Kubernetes Slack channel, LinkedIn, whatever. If you want to ask me questions related to MetaLED, I'd be happy to answer.