 That's a very graceful voice. See that clicker works. That's true. Can you guys see the shared screen just making sure? Yep. All right. Awesome. Thanks. So let me introduce the folks that put this thing together. I have Machi Wenski, who is together with me, we're the solutions architect working for Redhead. So we work a lot with the customers. We've done hundreds of POCs and Machi is a really strong networking guy. When we asked him to join the team, we knew he's very strong from that area. And we also have another, like the best expert you can get on the phone, Emilien Machi, he's a senior principal software engineer. So he's going to be presenting part of this and some of us, we're going to focus on the open stack piece. And then Emilien is going to focus on the Kubernetes part. This is, like we're going to try to make it as, obviously this session kind of way is presented by Redhead, but we're going to try to make it as open source as possible, since this is MIA. But there's going to be some pieces that are being developed from the product perspective. So just keep that in mind. Yeah, so anything else for the introduction or you guys? And again, please ask us the questions. Like this is a meetup, so it should be as interactive as possible. Let's try to make it just, you know, just, I wish you were here. You could throw like bananas at us if we're, you know, we have some tomatoes in the back. All right. So I think we're going to go pretty deep technically, but I want to kind of start with, with kind of a level set and maybe, you know, I don't know how many people are super technical in this call. And so I'm going to go over this quickly. It's like a, put this like as the evolution of the, of the data center, right? So how it all started, like we have just that, a bare middle box, we would put it in the rack, we would put some Cisco or Juniper switches on the top, then the operating system would land on it. And then, I don't know, it would be windows or Linux and then your database or web server or whatever it is, right? So, so old technology, all of that became abstracted. I think, I think IBM was the first company who, who invented the, you know, the virtualizations with the, with the IBM L-Pars, right? So they were able to slice a single piece of silicon, if you will, into, into a multiple virtual silicates. And then, but really, and where is the company who took it to the market? And they pretty much, even up to this day, they own majority of that hypervisor market. So this has changed. Like there was still the physical compute box with some traditional networkings. And then, you know, there was a, there was an influx of like traditional stores, like sans or fiber channels, et cetera, right? And then you would slice that PC in multiple pieces, put the OS and then application on top, right? And then kind of a next step in the evolution was the cloud, right? So AWS came out and they said, hey, we'll give you everything as a service, right? And everything is software defined. Everything is as a code, right? And, and they came up with this really cool stuff and everyone liked it and, you know, there was this company called Rackspace together with NASA and they said, hey, we want that, but just in the private club, right? So they took that concept. They kind of detached all of that proprietary software from the proprietary hardware, right? So, so all the networking storage and compute became software defined, right? And then, and then Kubernetes came in and they said, hey, let's abstract all of this OS. We don't care about any of it. And then let's just put like orchestrations of that, of the application in the containers, right? So that's kind of the very right side of this, of the slide. And then, but then AWS always did it better, right? Because they always had a concept of like availability zones, right? Like an ability to, to split this architecture into multiple failure domains, right? So, so it was hard to attract, at least from our perspective, attract like either customers or users into this, into the large deployments because we were always layer two, right? Everything over layer two. So, you know, one, one part of your infrastructure fails and the whole thing dies. So then, you know, what do we do from the, from the private club perspective, from the open stack and then Kubernetes we're going to talk about is taking these, these boxes and pretty much distributing them across the availability zones across the layer three network, network domain. All right, so quickly, I don't know how, if there's someone on the, on the call that's not super familiar with, with AWS or public cloud but these are some of the core services you can consume today in AWS, right? So you, on the compute side, you have your EC2, EKS, Lambda, right? On the networking, you have your EPC, ELB, KMS. So there's bunch of these services that Amazon came up with and then open source community. Hello. Hey, how's it going? Hey, good, how are you? Good. Please help yourself with the food. I don't know. We just started, so you're, you know, you're not, you're not super. Everyone. Yeah, so, so, you know, Amazon came up with all these services and they keep adding it like today, if you go to the Amazon, you know, AWS website, there's probably hundreds of different services, right? But the open source community were trying to match as many of them as, as, as they could. Obviously, it's impossible to match all of them, but we kind of, you know, from the, between OpenStack and Kubernetes, we were able to map a bunch of these services with these open source projects, right? So let's say heat is your equivalent of the cloud formation, right? I don't know. Cinder is your equivalent of EBS, ELB is your Octavia, et cetera, et cetera, right? So I can share these slides with, if you, you know, if you guys are interested and then you can see how I map everything together. But this is kind of the ecosystem that allows you to solve majority of the data center, you know, issues, right? Or the, as a service pieces. I'm going to pause. Anyone, anyone have any questions or so far so good? That makes sense? It's a possibility. Yeah, we're starting at a pretty high level, but we're going to go much deeper. So yeah, so OpenStack and Kubernetes, they're like match made in heaven. And, you know, they solve different, you know, they try to aim at solving different use cases and different problems. Although there's a lot of overlap between them, right? And, you know, we're going to, we're going to talk about like, we're going to start with the OpenStack deployment apologies and distributed OpenStack deployment apologies, right? So kind of evolution again, right? So the version number one, and this is something from the Reddit perspective as a product, we were able to do this since OSP 13, I think, which was train? No. Train 16. Queen's. Okay, thank you. So what we did here is we were able to put the control plane in like one layer too and then split the compute resources with the storage, if you will, and then put them in an additional layer too. So we were able to distribute, you know, the OpenStack architecture, if you will, over multiple, oh, over layer three and then, and this was, this is excellent architecture for like this geographically distributed use cases, right? The disadvantage of this is, right, like if your central location, right, if this AZ0 dies, well, you know, the AZ1 and AZ2 and AZ3 workloads are still up and they can function and communicate, but you can no longer, I don't know, spawn new VX or spawn new volumes or spawn new networks, right? So, you know, it has some issues. The advantages of it though is they are pretty resilient from the latency perspective, right? So you can, like, redhead supports pushing these AZ1, AZ2, AZ3s to like 100 milliseconds round-trip latency and it's supported. We did, like, the team that me and Maciej worked on, we do these hack-fest every year so every new release we do the hack-fest when we try to push our software to the limits. We were actually able to, like, inject 250 millisecond latency between the sites and we could still operate, you know, pretty well. But, you know, obviously if you're running production, you want to be in a supported architecture, if you will. Yeah, but, you know, still a very, very cool architecture and we actually have a lot of customers today running this and we talked to the customer today that is looking to build 100 DCN locations of a single control plane, if you will. So replicating this architecture with a single, like, three nodes DCN clusters across the U.S. Alright, and then version two of that and this is also something we've done for the customer is we... This customer was like a financial institution and they were trying to get as many nodes in their SLAs as possible. They actually came in and they said, we need, I think it was six nodes in our architecture for the applications. It was super important for them. So we came up with this architecture where it looked similar to the one before. The difference is we stretched the control plane across multiple failure domains, right? So it was later to stretch control plane but physically across multiple, you know, it was single data center but, you know, different power deliveries and there was a good amount of resiliency built in there. And in this kind of architecture, you can take the Kubernetes and the million is going to talk a little bit more about that but you can also stretch that across multiple AZs, right? So you get the extra resiliency from both your VMs running on OpenStack and your containers running on top of Kubernetes. Again, the storage is kind of local per AZ so there's no storage traffic traversing the DC core and it actually works really, really well. You know, there's still some disadvantage though, right? Like this stretched L2 control plane for both today, for both, like in this architecture for both OpenStack and Kubernetes, you know, it's not everyone likes to do it, right? It adds complexity to your, like, the life cycle of your cloud, right? Because you have to ensure all of the, you know, the L2 is working properly across these AZs. But it solved the use case and then now we're going to talk about the version 3 which is the core stuff. I'm going to turn it over to Machi. Okay, so as Chris mentioned in version 2, the control plane was stretched across sites but it was still using L2. In version 3, where we're going is we want to introduce layer 3 rather to both the control plane and data. So in version 2, we had distributed compute locations so the data plane was already using layer 3 but the control plane was still not there. We had to stretch that control plane using layer 2 networking and that was the only way to make it work. In this version 3 of the distributed data center we're going pure layer 3. What does that mean? That means that both control plane and data plane networks are going to be routed across a distributed set of, a dedicated set of routers, we call those spine routers, and each of the locations or sites or availability zones is considered a leap. In such architecture, the east-west traffic always has to traverse the spine so if our leaf 1 wants to send any control plane traffic to leaf 2, this traffic always has to traverse the spine and that same concept applies for our data plane traffic so using the same idea of distributing both control plane and data plane networks. There is multiple advantages to that. You were separating our failure domains. We of course have discrete broadcast domains at each location which makes for easier troubleshooting and the core technology here that we want to highlight though that we're using to make that happen is dynamic routing using BGP. Now let's talk a little bit about how the control plane using layer 3 is implemented. So the controllers here are represented using those purple bubbles and what we're doing is we're not any more limited by the network boundaries. The layer 2 networking is not imposing any more limitations. We can scale much easier. The way this is done is that pacemaker, which is responsible for installing the bit by bit addresses of the active control node is actually using loopback addresses to install those addresses instead as in the old deployment in OpenStack train aka Red Hat OpenStack Platform 16. Those loopback addresses were those VIP addresses were installed on the provided bridge responsible for the control plane network and this time we're installing those VIP addresses on a loopback interface and there's an open source routing daemon that redistributes those addresses into the BGP enabled control plane where all the controllers, all the network nodes and all the compute nodes are listening. With respect to the control plane we only care about the controllers. So we have these dips now ready available, advertised in our BGP enabled network and pacemaker takes care of selecting which node becomes active by installing that bit by bit address on a loopback interface of the controller that is active. How does the Layer 3 data plane work? Here we're getting into a bit more complicated scenario and there is a helper to make it happen. Here is where I wanted to talk about introducing the OVN BGP agent. The OVN BGP agent has a very important function and I'll talk about in detail exactly what it does in order to make the data plane networks be able to advertise over BGP. The main advantages of course is that we're distributing our data plane. We're not limited by network boundaries and we can scale these. So how does traffic processing work in this model? On the data plane networks the data plane networks are those that are hosting your workloads so your containers, your VMs, attach to the networks. The provider bridges which are attached to these networks in OVN are not assigned with any nicks. In practice what that means is that the traffic doesn't know how to reach that provider bridge. So we have to do some tricks using Linux networking. First is enable proxy ARC. Proxy ARC makes sure that any ARC requests that normally would be unanswered. There's no core changes in Neutron or OVN in order to make that happen. So any ARC requests which would normally be answered on the layer 2 network now are answered by the kernel. So if we turn that proxy ARC on that bridge those ARC requests will always be answered. The same applies for network discovery protocol 5.5.6. So if we're using dual stack enable that NDP proxy and that will make sure that these requests are not unanswered. This ensures that again any request that is coming in on that bridge will be answered and handled appropriately. What this means in turn is that when we have and I'm going to talk about this a little bit more in detail we have FRR running on our compute nodes, our controllers and our computes. FRR is open source rather than taking care of BGP and few other protocols that I'll explain in detail how they work. Ingress traffic processing. What happens when a new VM boots up and this is the heavy lifting that the OVN BGP agent does? OVN BGP agent will listen in the OVN southbound database and each time there's a new VM booted and this VM gets a new floating IP address assigned or it just boots in a provider network and it registers with its new IP address. The OVN BGP agent will pick up on that change. The first thing that it will do is it will add a host with the destination of that VM's IP address in a VRF that is attached to the provider bridge. That instructs the kernel to forward any traffic that is destined to that particular IP address to that bridge. So all the inbound traffic for example HTTP requests coming in that is destined for an IP address of interest belonging to a virtual machine or a container now knows that it has to be get forwarded to that bridge. The next thing that happens and that is an analogy to what happens on our control plane is FRR which is our BGP router will redistribute that host route into BGP and advertise it to all its peers. So now when we're working in a layer 3 enabled data center with BGP as our router protocol all the requests that are destined to that particular VM or container will get forwarded to that compute node that is hosting that workload. How does egress traffic processing work in this new scenario when we're using layer 3? Again the OVN BGP agent is our helper here and an example that I wanted to give is just simple HTTP request out coming out of that VM. So since there is no mix attached to that bridge we're going to have to do some interesting things. The next thing is that the OVN BGP agent will do will rewrite destination MAC address of that packet. We'll look it up in the OVN database using its flow and then rewrite the destination MAC address to that of the bridge so that outgoing traffic now hits the bridge. When that outgoing traffic hits that bridge and that bridge has routes for sending traffic outbound that and rewriting that destination MAC address ensures that this traffic will be subjected to that route table and in our case we're using FRR advertising default routes into the node so that traffic can now get out to its destination using those routes. If we didn't rewrite that MAC address that traffic would get lost. Let's talk about a little bit of the components to make that solution possible. Free-range routing, FRR. I did mention it a few times already so I wanted to give you a bit more details on what free-range routing is. FRR is a continuation or a fork actually of an older project open source routing called Quagga and you might ask what does Zebras have to do with routing? What do Zebras have to do with routing? Well, you've got to get asked. Not much, but in networking there is actually some synergy here. Quagga existed for many number of years. I remember using it as long as 15 years ago when I was learning BGP, when I was learning OSPA. It's an open source routing daemon that's been around for a long time. Except that it didn't see much development in the past few years. There were some bug fixes released but no new features were implemented. There was more in the maintenance space and it was losing interest. About two years ago, someone decided that Quagga is a great code base and it should live on and Quagga still lives on in its own shape. But once it was forked and renamed to free-range routing that's the analogy to chickens. That's when that project really picked up again. New features were implemented in it. For example, BFD, for example, ECMP, IPv6, or BGP. Quagga in its old original form didn't exactly have that rich feature set of routing daemons that it supports today. FRR works natively on Linux. It performs best on Linux but it also works on VSD with some caveats. Not all the routing protocols work on VSD systems. In our case, Quagga provides support in the context of OpenStack and OpenStack Wallaby, which is read at OpenStack version 17. We will be relying on three core pieces of it. So, FRR. First is BGP. Second is ECMP. And the third one is BFD. And let's... Well, before I wanted to mention that FRR is implemented using containers and it runs on all of the nodes in the OpenStack deployment. So, it will run on compute nodes, controllers, and network nodes if you're using network nodes. For now, it's going to call Quagga before. I think everyone I know is Quagga. I've learned networking on Quagga. It's a great tool to learn. Building homelabs. Always enjoyed using Quagga. So, when I've seen FRR, I see FRR as a reincarnation of Quagga. So, it warms my heart that there is developers, again, interested in pushing it forward, making bug fixes, and the project lives on. It used to be... And it continues to be my favorite learning tool for networking and experiments. So, what routing protocols we use in OpenStack Wallaby aka Red Hat OpenStack Platform 17 to make a distributed Layer 3 control and data plane networks happen. The first one is BGP. BGP is a Stensive Border Gateway protocol. It's an exterior routing protocol and in de facto, that is the routing protocol of the Internet. It has been adopted as such. It scales really well. It can handle route tables of, not going to say infinite size, but I believe that the global infinite routing table in IPv4 must be approaching a million routes. It probably is there right now. The last time I looked at the full feed. BGP is a distance vector routing protocol. What does that mean? That means that the selections done by the routing protocol, as far as the next half processing, are based on the shortest path. And what does a path represents in BGP? In BGP, every router identifies itself with an autonomous system number. And when BGP looks at a path to traverse to some destination, it looks at a path of autonomous systems that the router is in its path path. So of course the shortest path is preferred. This is the default basic, most basic selection of the best path in BGP. And there's many other mechanisms that can influence it. But on a basic level, that's how BGP works. BGP was also adopted as default protocol for spine and leaf data center network deployments. Now the reasoning here is pretty simple. It's because it scales well. And when we want to scale an enterprise data center, we want to pick a protocol that will enable us to scale without further considerations for it. And BGP is that protocol, right? It is the fact that it's the most relevant protocol of the whole internet. Top choice for I would say the hyperscale providers as well, right? Yeah, BGP made its way from outside the data center to inside the data center. Now we're not only routing, you know, it is an exterior protocol by design, but because it scales so well, the demands of modern hyperscale data centers are such that BGP is a way for it as well. Now, BGP is a good routing protocol, but BFD is a very interesting piece that can improve BGP in many aspects. BGP has some shortcomings. The default hold time is for routes. For example, let's assume a scenario where a BGP session between two routers becomes stale. It drops down. There's a network issue. In BGP, there's a property called hold time. This hold time refers to the amount of time that a router will hold the routes in its route table. Even after it acknowledges that that pier is unreachable. Those timers are only tweakable to a certain degree. And you can imagine that if you have multiple paths available you want to use those multiple paths to a destination as soon as your primary path goes down. That's where BFD comes in. BFD is great at very aggressive, constant monitoring. When integrated with BGP, it will monitor every BGP speaker that is enabled on a router. In the case of a failure, it will immediately remove the routes from a route table, allowing the secondary or territory path to take over. I like to call BFD BGP's little helper. That is not the saying that BFD only works with BGP. BFD actually works with all sorts of routing protocols including OSPF protocols which are not distance vector routing protocols, Linkstate protocol like OSPF. And here at GRP, you name it, BFD has been implemented into pretty much every dynamic routing protocol with one goal, to improve your time to recover. When there is a failure, you want that traffic to switch over to your standby path right away. BFD is great at that. The next thing I wanted to mention that the FRR brings to OpenStack is ECMP. ECMP is stands for Equal Cost Multi-Path Routing. And that's a pretty simple concept. It allows the kernel to install multiple routes to the same destination. And that aids and, you know, those three things we already talked about how BFD helps you to recover from failures quicker. ECMP is the mechanism that will allow you to have simultaneously installed two routes to the same destination. There's some cadets, of course, about these routes. They have to share some common properties. One of them is, of course, the destination. But they also have to have the same metric. They have to be pretty much identical routes. Except for the next one. Here we wanted to talk about we wanted to compare the traditional Layer 3 architecture to Spine and Leaf. Spine and Leaf is the architecture that we're targeting what our V3 version of a Layer 3 distributed data center. The traditional Layer 3 layers, I'm sorry, not to be confused with Layer 3 is really built with emphasis on optimizing Nord-South traffic. We're trying to get traffic in and out of the data center not so much from east to west where some traffic actually stays within the data center. High latency creates traffic bottlenecks and those two that last traffic bottleneck limitation that's mainly due to the widespread use of Layer 2. There's no other way to get out of that data center then through your core layer which can actually support Layer 3 revenue. Now we're moving into a Spine and Leaf architecture. We're putting emphasis not only on Nord-South traffic. You get that almost free of your top or back switches to your Spines. But you can also efficiently route east-west traffic within the data center. You can distribute your data center data plane and efficiently route traffic between those availability zones also called Leafs in here. The latency in such a deployment is lower and much more predictable. We already mentioned that it's so much easier to scale that kind of deployment. You just add more blocks of that making sure that those new blocks called Leafs here are connected to all your Spine revenues. And then a very important feature here is that the failure domains are isolated to the Leaf. Why is that? Your failures usually happen within broadcast domains. Broadcast domains are tightly coupled or synonymous with Layer 2. Now when we have multiple Leafs, each of these Leafs can actually use duplicate VLANs. No one cares anymore. All you care about is that Layer 3 networks meaning subnets in IPv4 or IPv6 that are associated with those. When we're dealing with troubleshooting you always contain the troubleshooting to that broadcast domain. Quick conversation about overlay networking versus provider networking. Maybe I'll give you a short introduction into what provider networking is versus overlay networking in the context of OpenStack. Provider networks in OpenStack always have a one-to-one relationship with a network that exists outside of the base. That's a network that's been provisioned by the operator. It always involves involvement outside of tenant in order to provision those networks. So a good example of that would be a VLAN. That is something that exists on a switch. That is not something that exists inside OpenStack unless you create a provider network which links that VLAN to an OpenStack network. Overlay networks on the other hand are managed by the tenant. The tenant can create those networks without involvement of an operator or administrator. And thanks to that they're virtually unlimited. For example, if we look at a VLAN I forgot what the maximum number is but it's 65,000. And compare that to VLANs which is 4094. So virtually unlimited. 64,000 will run out at some point but it won't run out as quick as 4,000 VLANs. Now what makes overlay networks more attractive to cloud use cases? The main benefit that we're looking at here is that self-service we're dealing with infrastructure as a service, platform as a service. We want our users to be able to request those. We don't want our users to put them. Excuse me. We don't want our users to request tickets to ID for provisioning a new VLAN. So we're giving the power to our users to create those inside our infrastructure. Some of the downsides of overlay networking, especially when we use with conjunction with Kubernetes. Kubernetes uses its own networking model and there's a good chance that we might take a little bit and we double encapsulate our traffic. If we're using VLAN, VXLAN networks in OpenStack and that OpenStack is also hosting Kubernetes deployment on top we will be double encapsulating so we'll take some hits on performance and from the standpoint of administrators who really like to have insight into what happens in your networks, although not exactly they don't really have insight into those. Those exist strictly in your compute environment or your Kubernetes environment. How to provide the networks? Of course, admins have perfect insight into them. They'll enjoy the full performance of the networking gear that they exist on. Segmentation per site that's also an interesting interesting notion here, especially when we're dealing with that deep networking and less cloudy. Interesting concept of being less cloudy. Cloud is all about self-service. So if we provide the networks and we try to build clouds on top of that you don't want to run into roadblocks trying to provision your networking. You want to deploy that application right away and self-provision networking foil and that would be your choice with overlay networks. We have a million on the call. He's going to give us, you know, he'll walk us through how this works in the Kubernetes and give us a cool demo at the end. I don't know, a million, are you still there? Yeah, I was waiting for the beer but, you know, I'm sitting there, yeah. All right, well, a million, I'll stop sharing and you can take over the screen. Okay, thanks. So, first of all, yes, this is interactive. So even if I'm on the phone, feel free to interact. Just speak if there is anything, any question. It seems like I have less than 15 minutes. I might go over if that's fine. Yeah. And what else? Yes, so my name is Emilia. I'm in Canada. I'm in Canada. And yes, so my job is to, I used to work a lot on the OpenStack project for the last 10 years and now my focus is more on the integration between Kubernetes and OpenStack with a strong focus on use cases like Edge, Telco and large-scale deployments when we have Kubernetes deployed on top of OpenStack. I have a small, oh, I should share my screen, right? Yeah, if you could. Yeah, sounds good. We can keep looking at you. You're being gorgeous on that big screen. I'm not. I'm actually tired. All right, let me OK. Let me share my screen. I almost got this. All right, thanks. So, yeah, small, just a small note about the following discussion. Again, it's about some walking progress and yeah. And I'm going to present something under discussion in the community, which doesn't mean that won't exist. It just means that right now what you see is like a very early stage on what we are working on and you should feel very lucky to see that because like, for example, the upstream spec was proposed this week and that's very early and what I'm proposing today is it's a good follow-up on what was presented just before because it talks about exactly the same concept. I'm not going to repeat what was said but I want to start with a little bit of history here and like it was mentioned before but when you wanted to scale the the data center which can be open stack or something else, you will create the different sites and some workloads would have to be within the same layer to network for some reasons and that was the case of the Kubernetes control plane for example where the API and ingress traffic will have to be within the same network and tools like QIPaLiveD and HEProxy would be managing this kind of traffic at least that's the case in OpenShift but there are many ways of managing this networking in the control plane and we are going to go through one of the improvements that we are working on but basically what you need to know is and that was mentioned before when you run one giant layer to network domain the the biggest issue that we have seen traffic bottlenecks and also it's very difficult to maintain for the network administrators so especially when you have to extend like a new availability zone for OpenStack and on top of that done you want to install the Kubernetes cluster that has to communicate with another zone it's very complex to implement at the moment and that's what we are trying to solve so and again what we are going to propose is exactly what has been done in the OpenStack community where we are bringing more BGP into Kubernetes and not just for the workloads but for the control plane as well so I like to put a slide when I'm doing a presentation I like to put a slide where I explain some use case and let's say that today the use case is you have a very large deployment of OpenStack and of course of Kubernetes you have critical applications and you have very high SLA requirements so which means that if in your data center you have some poor failure on the rack or in a room the application that was running in this location has to be able to fail over very quickly to another area so this is a big challenge especially you know with all the constraint that you have with infrastructure so OpenStack has this architecture that is named distributed compute nodes which can be also seen as the reference architecture for Red Hat when it comes to deploy OpenStack at the edge but you can also use this architecture to deploy OpenStack within the same data center and you would think about the locations that you have remote edge sites in the same data center that would be just like a room or like a rack of servers that have independent power and network resources so the idea is to be able to do the same for Kubernetes having separate resources and we can stretch the cluster across those resources yeah and just one note but BGP is a very popular protocol and has been used not just on the internet but in many data centers already so which is why it's kind of like the obvious protocol that we want to to use for the needs that we have so this is kind of like this is very much from a previous slide but kind of like more Kubernetes oriented but on the left side you have the traditional architecture where you would have let's say two sites which we call failure domains those can be two data centers or again they can be two rooms, two racks but they have independent power and network resources as you can see and we will basically deploy the Kubernetes cluster on those two domains but for the control plane to be able to work you will have to stretch the layer to network which again can be very problematic and the reason that you would have to stretch is because most of the tools that are used on the Kubernetes control plane now only work with within the same network for example KippleIFD that is being used in OpenShift when you deploy OpenShift on top of BarMetal, VMware OpenStack and some other platforms on-prem platforms KippleIFD used the VRP protocol that required the same layer to say layer to network between the master nodes so it has limited SLA scalability limits like we said so I'm not going to repeat that on the right side you have the modern architecture which is also named SpinalIF and like the big difference here is that you have three independent networks so you could think about three independent subnets that are connected to their leaf routers and connected to the Spinal3 and what we are doing here is bringing down BGP down to the Kubernetes cluster control plane so that's the idea it's a very big picture and I know it's late for most of us at this time it's very difficult to process even for me but if you look at this and I'm going to show a demo after that and I can go back and forth with the picture to show you exactly what it means but when you bring BGP down to the Kubernetes control plane what you can do is having the VIPs so the Kubernetes control plane virtual IPs for API and Ingress you can also think about egress and we don't talk about the workloads today because there are many tools in the Kubernetes community for using BGP load balancers for the workloads you might know about Metal LB which is a very famous project very popular which has FRR as a backend for BGP so you can create your you can create a load balancer in Kubernetes that will peer the routes on how to reach the workloads to the BGP peers so that's really for the workloads where today we speak about the control plane and what you see on the screen is basically three three sites, three availability zones that we also call three failure domains and each of them has Kubernetes master node so you have in total you have three master nodes and as much as workers as you want so here we have only three but you can have many as you want and the thing I want to highlight here is that you have a container for FRR that will manage the route the route peering to the leaf and then to the spine and it will peer the routes to reach the VIPs and we are working on making it so the VIPs will just be created by a simple IP commands on the host and FRR will watch for local routes to see if the VIP does exist and if yes it will communicate this through the BGP protocol in the leaf and the leaf will communicate to the spine so I'm going to start a demo it's a very quick and I would say simple demo and then I'm happy to take questions or go back to something if needed but what I want to show you is three masters with Ingress I want to see you that Ingress is quickly failed over and load balance between the masters and that's something I also want to show you in the BGP logs that are on the host even though I'm sure it's not very useful to see to show you on all the logs but at least you can see a little bit how it's wired so I think I can stop sharing my slides and I will start sharing my other screen but you should see my screen okay not too small okay so let's start with the basics oops I have so this is my lab okay I don't have like three sites right now I only have one but I have a VM that replicates like a spine router I have a VM for the leaf router and then I have three Kubernetes masternodes what I want to show you is one of the masternodes the masternode has the VIP on the loopback device if you look on the loopback device you will see 192.168 100.250 and 240 those two IPs are the VIPs one is for the API and the other one is for ingress so what we have here is a static pod running on the host monitoring for the API and ingress controllers and if the API and ingress are up and running then it will create the VIP and as soon as it creates the VIP FRR which also runs in a static pod on this host will peer with the leaf router that hey I have this IP address here if you want to know how to reach it this is the route so that's it's all about that so I'm going to show you the FRR container this is just the logs of the FRR static pod on my Kubernetes control plane I can show you the configuration if you're familiar with BGP and FRR it's a pretty basic configuration but it's basically peering with the the uplink router is the leaf router the one that will later peer with the spine and I have basically access list to share the two VIPs that I want to to route later we can do something funny after that actually I will demo that so so on the right side you have the three masters so I'm going to run TCP dump on the on the IP address of ingress and on the left side I'm going to cur an application that I deployed in my cluster so it seems like the ingress traffic is going on the master number two so this is great master two seems up and running now what happens if I change the health checks of ingress on the master two and force it to fail so I don't want to destroy my cluster now because I want to keep the demo going I'm just going to change the health check so you can see what happens when I and I change the health check so instead of doing just my regular curve I'm just going to quit the script exit one I'm going to show you what happens in real time give me one second okay so on the master two I'm just watching for the VIP on the local host to make sure that it's created and on the right side I'm going to write the health check script that should return an error and what we should see is the VIP being removed and then I could even show you the logs to show you that the route will be changed so if I well actually I could even go on the this is the leaf router I don't know if we can grab for the VIP okay let's see what happens okay so the route was removed sorry the VIP was removed on the leaf so what happens on the leaf was it received an update you know to say exactly what happens and how to reach the the new VIP so now I only have one host where I can reach the VIP if I go back and I remove my thing I have two routes so let's do it again and now do my curve and what I should see is TCP dump on the top right side which is my host that remains reachable to have the VIP and now I can see the traffic on the other side I hope it's useful to wire things on your head about how we are using BGP but yeah how long does it take to switch over so thanks to well there are multiple things to take in account here there is of course there is BFD and BFD is really fast if you know about it it's a matter of few milliseconds it's also a matter of you know how do you implement the static pods in Kubernetes to manage the VIPs right now it's a loop that watch for the health check you know but it's really fast it's really fast what we are aiming for is one second of convergence for API and three seconds for Ingress so that's what we are aiming for right now but we could even go faster just a matter of how much CPU it will consume so it's something that you have to choose at some points it's a balance do you find that BFD consumes a lot of CPU when it's checking down aggressively no but the health checks on Kubernetes yes those are expensive yes they are well individually they are not but when you have multiple VIPs multiple scripts and multiple masters that becomes expensive so you need to be careful but what you see right here is it's an implementation that is not too much expensive in terms of CPU and as you could see it's almost real time so what's the tie in between the health check and the the VIP route being removed how does that work so once the health checks return the failure the we have I can you want me to go in the code or what no we basically remove the IP address from the loopback device as long as the health checks reports a failure which means that as soon as the VIP is removed FRR will see it immediately well actually BFD will remove will see it immediately and then as you saw the BGP route has been updated so it's really fast yes so yeah what's next like I said it's under design it's under POC but it's made pretty good progress on that I hope I'll ship that as tech preview in the next cycle yeah I mean what's very interesting in that architecture is and what I like the most is comparing to the traditional architecture where in the stretch layer too it's kind of like not not real not real agile like the services the API in ingress they have to be they have to share the same resources but with BGP you kind of like open the door to many configurations many possible configurations because in BGP one of the important configuration that will need to be done is on the leaf and on the spine networks so you can increase convergence time by configuring BGP in a certain way in your spine or if you want to do API load balancing you can configure BGP in a certain way in your leaf so there is a lot of things that we won't have control and that's for the best so you give back some control on infrastructure networking back to the network admins and I think it's very important especially for Kubernetes to not have control on everything and this new architecture where we will just be peering what's happening on the cluster I think it's I think it's going to be great great yeah so yeah the next step from a technical point of view we are working on the API for Kubernetes failure domains in the cloud provider cloud provider open stack there is an upstream spec that is being done right now in Kubernetes and that will allow us to deploy the clusters across multiple failure domains right now the demo as you see it's in one failure domain but we are working on stretching the cluster across multiple domains it's kind of like a parallel effort of BGP is there any questions I'm very happy to answer anyone still awake so this is very good right a couple of more things we want to look at in the future performance and scale as we scale the number of BGP sessions the mesh scale and number of nodes in the mesh etc number of routes and all those things what impact it has on the forwarding performance as well right so that's something to look at again right now we are just talking about like potentially two IP addresses that we want to route which is the ingress and the API but we need to think about the collaboration with the workloads and I'm talking about I'm talking about Metal LB we are and this is very early discussion right now but we are thinking about whether or not we need to make the control plane BGP instance collaborating with the BGP instance of Metal LB kind of like sharing the same because as you know you cannot have easily you cannot have two BGP sessions on the same network between the same two nodes Cisco has something that they implement but in FRR you don't have such a thing although we are investigating how we could do that but the idea would be that in the end we will share BGP configs and BGP sessions between the workloads and the control plane potentially so all of this is kind of like roadmap for the next 12 months kind of using it as a route reflector on the control node would be another option exactly so question I appreciate the demo and the talk guys I understand the networking side and why we would want to do that I'm curious how are you handling the database synchronization amongst multi-site across the control planes the which is my SQL for the most part and obviously something like that with I can take that one so since the controllers are now distributed on separate layer 3 networks they all have layer 3 connectivity so in that aspect the database replication mechanism doesn't change it is just using a different transport but we are using layer 3 networking to do the replication but are you splitting sites with separate physical locations a controller in each physical location for example and with that are you pointing so your compute nodes are then talking to that local controller that is there is a dip address that is being presented and so whenever the controller needs to talk back to the control plane services it uses that dip and only one is of course active so you have compute nodes then still hopping across sites theoretically talking to the local yep so it is a database is that I don't think that is such a heavy traffic from that perspective nothing has changed from the previous architecture it is pretty much the same but again I think the limitation that we are trying to remove is again like putting everything in the same L2 and like getting the broadcast storm and killing the entire control plane but the rest of it I think there is there is really not too many changes in the other components how they work it is just like we are working around the networking flow but all of the other components they work exactly the same as they had before as far as the ETCD it is pretty much the same as where and ETCD recommends to run ETCD cluster within the same data center and if you have to like for if you want to deploy ETCD another node into another failure domain they recommend to use a very close data center but again the use case of this large deployment that requires a very high SLA for the applications they are deployed within the same data center and they use different racks and power and networks but the latency should be really close the latency you are doing basically a rack is a failure zone but it is all within the same segment I don't know if you are looking for a geographically distributed controller I don't think that model would fit here it would be too much latency to geographically distribute them you still need to be conscious of keeping those latencies low all of the azs yep alright there is no more question we appreciate everyone who stayed with us Emilia that was awesome demo I haven't seen this before it was first time for me me neither we are ready to go that was great I didn't see this demo either before because it works since yesterday very cool I'm glad you like it yeah thanks so much everyone for joining this session was recorded and we are going to post it on it's going to be posted on open and proud foundation too if you want to go back and check some of the details and ping us up if you need like any more information too thanks guys, thank you thank you guys