 Okay, so the last presentation of the open-infra-days here today going to be about open-source networking in a modern data center. What I wanted to talk about today is how you can leverage open-source routing to implement a spine and lead architecture in a modern data center, trying to avoid using layer two networking and use as much layer three networking, overcoming the shortcomings of layer two networking with layer three networking. What's on the agenda? First, I wanna talk about a little bit about OpenStack, DCN deployment architecture. DCN stands for distributed compute node. It's an architecture where you deploy OpenStack with compute nodes spread around across different data centers. So imagine a deployment where your compute nodes are running in Chicago, in New York, and perhaps Miami, and then your control plane runs in Chicago. That's a classical DCN deployment architecture where you try to get your workloads as closest to the code or to where the requests are coming in. Then I wanted to briefly touch about for the Gateway Protocol BGP, that's what we're gonna use to implement dynamic routing in the data center. And then two other really useful protocols, BFD, bi-directional forwarding protection, a very useful helper for BGP in this case, helping to quickly remove routes when there's link failures. And then we have people cost multi-path, a very useful protocol in the Linux kernel that allows to install multiple routes in the route table with the same destination but a different next hop. And thanks to that, we can do things like load balancing across those next hops. First, let's talk about how you do implement a control plane in a distributed fashion, and what a road to a truly distributed control plane looks like in an open stack environment. So first, we're starting with a layer-to-control plane, all of our controller machines in open stack, which run the control plane services, are collocated in one layer-to-network. So you can't really spread those around, you can't distribute them, right? They're bound by that layer-to-network. A little bit of evolution from that is you try to distribute the layer-to-control plane network with the helpers, help of tunneling protocols. So you could potentially stretch that layer-to-control plane network and start distributing control plane across geographically or in your region, and benefit from having another single point of failure and that one layer-to-domain. And then going from there, once we switch to layer three, we open up a bit more possibilities for distributing that control plane and we'll talk about how the routing protocols I just mentioned aid in that. So typical, this is the typical DCN environment today on open stack, on the left-hand side here, you have the first rack with the purple squares where we have all the controllers sitting on the same layer-to-network. Rack is the single point of failure, right? If you lose this rack, you've just lost your control plane. Your workloads may be running on rack two and rack three, the middle one and the one on the right-hand side. It might be unaffected, but when you lose your control plane, you lose the ability to create new workloads, cause new workloads. And I say workloads, those are virtual machines typically. Add new storage, add new virtual networks. Your ability to control your data center is gone. So single point of failure, and that's mainly due to that layer-to-boundary imposed by our control plane sitting on that one broadcast domain. You can still distribute your workloads, right? We have two of the racks. This is our data plane with the blue squares. So we can still do that. That's okay. But again, we're focusing on the control plane being the single point of failure in this architecture. Now take it a step further. Let's try and distribute that control plane, right? What does it take to distribute it since the control plane is really bound by the layer-to? Well, let's stretch that layer-to-across three racks. So now you have that pink bubble present in each of the three racks. There's a controller sitting in each of the three racks. But the way to make it happen is since the implementation relies on layer-to, there is networking protocols there that are relying on the broadcast transmission. So we have to stretch that layer-to-across those three racks. So we have that benefit of actually distributing the control plane across the three racks, but there's an additional technical complexity involved here in that some third-party technology has to implement layer-to tunneling to spread that layer-to broadcast domain across the three racks. The data plane remains the same. You can still place your workhomes across the three racks. So we still have a distributed data plane. Slightly better, right? Complex, but it gives you a distributed control plane. Now, the target version, right? This is where I think everyone should go in the data center. We're distributed the control plane. We're stopping to relying on things like layer-to tunneling. Instead, we're just ditching that layer-to completely in the control plane, right? And in order to do that, you have to use some helpers. And that helper here is a routing protocol, dynamic routing protocol, VGP, which is used to advertise the availability of the active control plane machines into your upstream network. In this case, the upstream network are the spine one and spine two switches, into which all of those three racks are wide. So what I was gonna say, okay. So same thing with the data plane, still benefiting from using distributed workloads on the data plane. But there's one extra thing here, right? Now that we have VGP that is able to advertise the availability of the control plane machines. We can also use VGP to advertise the workloads that are running in your data plane. So imagine you have a new virtual machine coming up and when that machine comes up, there's a software router that is running on the compute node where that virtual machine is running. And as soon as that virtual machine is up, its IP address is advertised using a host route up to your switching fabric to the spine and then redistributed from there either to your corporate network or straight up to the internet if that is a public network. So let's talk about the distributed control plane and how it works. So we're no longer have the boundary of the layer two network. Each of these racks, rack number one, rack number two, rack number three, is it's only layer two domain. There is no layer two communication between those two racks. So we're reducing the size of potential broadcast storms. We're reducing the surface attack as well, right? We're containing that traffic into that one particular rack. And this is especially important when it comes to the control plane where are we running our APIs, full public APIs or internal APIs in OpenStack. There's a little bit of a change in how this works from the previous version of that architecture. We are advertising virtual IP addresses much like previously we used VRRP to advertise an IP address. VRP is a layer two protocol that relies on broadcast. Now, instead of doing that using VRRP, we're assigning those VIP addresses to a loopback IP address running on the controller. And from there, there's a routing process that picks out that IP address once it's available and redistributes that and pushes it out, advertises it out to the rest of the network. Control plane, same thing. I'm sorry, data plane, same thing. We have instead of virtual IP addresses which advertise the availability of APIs, we have virtual machines that are being booted up for daily operations to boot up, spin up virtual machines, bring them down. And as soon as those get booted up, the availability of the IP addresses that are used by those VMs is advertised using BGP after the rest of the network. Now, we'll talk about the pieces that actually make it happen a little bit more in detail. So what happens in detail on your layer three data plane? So the first thing that happens, you have, there's a little, if you're familiar with how OpenStack works, there's provided bridges which handle VM networking. Typically those provided bridges have physical nicks assigned to them. So you have a bridge, you have one nick and when the VMs get booted, they get added with an interface to that bridge. And that's how the traffic makes it in and out. It's kind of simple, right? We're having a little bit of a different scenario here because the bridges where the VMs get booted up and get attached don't have any physical nicks. The reason we're doing that is that you want to redistribute that, the traffic that is being involved and have our traffic from these VMs, you want to make that available through BGP. So the first thing that had to happen for the traffic processing, in order for that bridge to respond to any kind of a traffic, well, you have to enable proxy ARP. What is proxy ARP? For IPv4 traffic proxy ARP, there's a simple piece of technology that just says respond to ARP requests on that given network. Similar idea for IPv6 traffic, where in IPv6 we don't have ARP anymore, we have NDP, NDP stands for Network Discovery Protocol. You enable that proxy NDP on that given bridge and the kernel, the Linux kernel, now can answer to NDP requests, much like it answers to ARP requests. What happens when a new VM boots up? Credit goes to this gentleman, Darren, for this drawing, I love it. New virtual machine comes up, right? There's a piece of software called OVN-BGP agent. The OVN-BGP agent interacts with the Linux kernel. It also interacts with the OVN southbound database where it listens for events. And then it also interacts with a routing process called FRR, FRR is a software router, which I'll describe a bit in more detail. So what happens first? Virtual machine comes up, the OVN-BGP agent is listening for events in the southbound OVN database. So what is OVN-Agent going to find out from that database when the VM boots up? The most important piece is what IP address that VM uses. So now that the OVN-BGP agent knows that there's a new VM and a new IP, it really needs to point it to a bridge. And the way to do that is we're using kernel networking, using IP rules, we're directing traffic, that is destined to that VM, to that given bridge. And that is specifically for ingress traffic, right? If you create an IP rule that says from any to that VM 10.0.20.30, go to that bridge. Then if that traffic is arriving through BGP over some one of the BGP links, which is not tied to that bridge, we're effectively redirecting that traffic to that bridge. Then what happens, right? That VM comes up and we want to advertise the availability of that VM's IP address in our BGP enabled network. So the last piece in that transaction is that the OVN-BGP agent will redistribute that route into BGP by bringing up an interface on a loopback which gets redistributed into that rather programming process. BGP traffic is a little bit different, right? We have the same scenario, right? We have a bridge where we want to direct that traffic to and that bridge has to push the traffic out for that particular VM. So when that VM boots, the OVN-BGP agent will also pick up the MAC address of that VM, it will look up the flow in the OVN flow table and it will rewrite the destination of packets that are arriving from that given VM, given that particular flow to the MAC address of the bridge. And guess what we did just in the previous step? We enabled proxy on from that bridge. So if there's an IP request or an NDP request destined for that bridge, and thanks to doing that, now when that VM is trying to look for its gateway, right? It's just trying to send the traffic out. So it's sending our request for a Dolphana address and it's trying to look up, you know, where is our, what market address is our Dolphana address, which is my gateway? That bridge, thanks to that destination MAC being overwritten by the OVN-BGP agent that destination MAC being overwritten for that given flow will respond with that. How is that useful, right? Now that the traffic has actually reached the bridge and the bridge is able to respond with our request and a network discovery protocol request for V6, we can use the kernel's routing table to forward that traffic out. So now we got that traffic on that bridge and there's routes being advertised over BGP and installed in the routing table. And from now, from there, that traffic can be acted before that using the routes in the routing table to its final destination. So that's how egress traffic processing works when you're using BGP with OpenStack. What makes this all possible, right? There's one core piece here. Of course, OpenStack is at the core here, but the addition to OpenStack that makes it all possible is an open source routing people called FRR. FRR stands for Pre-Range Routing. It's a very interesting project that I've learned. Most of the dynamic routing protocols using its predecessor, which was called QUAHA. And you will find references to Zebras and QUAGAS still in FRR if you dig it. For example, the daemon that takes care of installing the routes in the routing table is still called Zebra in FRR. So about three years ago, maybe four years ago, a few smart people decided to resurrect QUAHA. QUAHA existed for many, many years, more than 20 years or so. They forked it and they actively continued the development that was started many years ago. QUAHA, I'm sorry, FRR routing supports many different routing protocols, including OSPF, including RIP, even version one, version two. It supports OSPF v3 as well. It can do OSPF over v6. It's a really good learning tool for routing protocols. The three interests here for this deployment of OpenStack are BGP, we already talked about it, ECMP and BFD. All those three protocols are implemented in QUAHA and FRR routing, and we can benefit from using those to implement a spying and leave architecture in our data center. The way that FRR is implemented on OpenStack these days is it's supported by a triple O. So you can deploy router instances that are created using containers on your compute nodes, on your controller nodes, and also on your network nodes. So anywhere where a workload, a data plane workload like a VM or API from the control plane can be advertised, that's where you can run an FRR container. We're also mentioning FRR will run on pretty much any Linux flavor probably, and the majority of the BSD systems out there. Let's talk about the acronyms that we used in this presentation. BGP, BGP is a border, it's a border gateway protocol, it's an external gateway protocol. It is a distance vector routing protocol where routing decisions are made not based on the amount of hops. Like when you do a trace route, sometimes you see like 20 hops, right? BGP doesn't look at the amount of hops that you would take to reach a destination. It actually considers how many autonomous systems, which are usually large-scale networks, imagine Comcast, imagine Red Hat, imagine AT&T, how many of these networks do you have to traverse to get to that destination? So this is called the ASPAT, autonomous system path. Every BGP route is assigned to a given autonomous system. So in this case, we're looking at how many autonomous systems do we have to traverse to get to the destination of choice. BGP has been the routing protocol of choice in spinal reef technologies for major reason is really because of its scale, right? If we want to implement an infinitely scalable data center, we shall pick a routing protocol that can scale to the whole planet, and that is BGP today. BGP is used to route traffic between all the continents, all the countries, and all the internet service providers within a country. That is BGP. The route tables that BGP handles today are, I believe, over a million routes. So BGP has enough power in it. This is architected enough to scale to that level. BFD, interesting protocol. BGP has been around for about 30 or 40 years now. Since the inception of internet, well, maybe not that. It came right after that. But BGP brings in some legacy luggage with it, and one of those is something called a hold time. Hold time is basically a time in which BGP router will drop the routes learned from a pier after not hearing back from it. So it sends a keep alive, and let's say you've set the hold time to three minutes, right? It will wait up to three minutes from a missed keep alive packet for removing the route. So if that pier is really down, things could get pretty dodgy, right? You're sending that traffic to absolute black hole. And the lowest that you can configure a BGP timer, hold time, is three seconds, which doesn't seem like a lot, but in modern internet, right? Three seconds is a massive loss. You can lose a lot of packets in three seconds. So BFD comes in. BFD does much, much more aggressive monitoring of the link. It can do it on a sub millisecond, so milliseconds time interval. And instead of waiting for that hold time to expire, it will actually take that session right down and remove the routes from the route table. Now, where that comes really useful is when it is integrated with the next protocol that I'm going to talk about. Because what is it good if you take out all the routes from the route table if there isn't another route to get to the destination that you want to go to? And that's where ECMP comes in, right? We have multiple routes now around table. They're destined for the same network, same destination. They use different hops. And with the help of BFD, that can instantly just withdraw the routes which are coming from a period that just went down. ECMP allows you to right away take over, right? There's another route in the route table. The old routes I would drone, and the new routes are being considered right away. Moreover, in scenario where you're using ECMP, you can also use roundrobin to balance your existing routes, right? So if we're looking at implementing a Layer 3 data center without the need of protocols like bonding using static lags, maybe LACP, multi-chassis lags, we can replace a lot of that functionality that is implemented on the switches with ECMP and specifically BFD, right? So we're implementing link redundancy, link aggregation, all using a software provided by FRR and these protocols that enable that for you. I think that's pretty impressive because it allows you to use pretty much very cheap switches, right? You don't have to invest in multi-chassis lags, which I've had all bad luck with multi-chassis lags in my life, like replicating arms to the other switch and it's just asking for trouble. So by adding more Layer 3 links to your server, you can actually get around this and implement highly available and link full tolerance in your data center. A few links here for references. Chris is going to ask a question right now. So is there like a performance difference between ECMP and LACP from like the... because you're involved with this or... There is a hit. So you're asking the right question, right? BFD, since it allows these sub-second recoveries from link fail-lists, it's a little bit resource-intensive, right? So it will actually throw up some CPU by doing those collides instead of every three seconds like BGP would do it every few milliseconds, right? So there is an overhead that needs to be accommodated when you're backing out the hardware for your data center because that overhead will be spent on the compute node, on the controller, on the network node where FRR runs with BFD. I wouldn't say that there's like a performance impact, but there is an additional consideration for the resources that will be consumed by that software rather where BFD runs. And if we're aiming for something like... LACP is pretty good, actually it's pretty great for doing recoveries when the link goes down, you just don't even see a packet blip if you're running a pin, right? And since we're aiming to do the same thing with BFD, those timers in BFD have to be clocked pretty low. That interval should be as low as possible. And all of that translates into a characteristic user. With the advent of modern hardware though, it's easier and easier to have 128 cores on a server, so that really helps. Does that answer your question, Chris? Thank you. Where's the demo? The demo is in the lab. And we're not in the lab. Some great pointers here in the references. The first one is a blog that Louis Thomas Bolivar writes. He's an excellent software engineer. The brain behind the OVM BGP agent. I really enjoy reading his entries. I highly recommend you do that too. The second article is about introduction of the features that I talked about with BGP. In red, that opens stack 17. I have to go on to read. And the last one is a link to the deployment guide for installing this, so you can pick that up in red, that opens stack platform version 17.1. Install that in the lab, play around with it. I am in the process of doing that, and that's why the demo is in the lab. Any questions for me? I knew it. That's a boring networking presentation for the last time. Well, thanks for joining us. So actually, as much as I love to try this, trying to wrap my head around what it would take to migrate to this technology from an existing production cloud, is that nearly impossible? You would have to build a new environment. I don't like migrating to this in a staggered fashion. It might be a little too difficult, especially because as soon as you want the compute node to work in this architecture, you have to disconnect the nicks of your provider bridges, which means breaking the networking for all your VMs. So downtime, 100%, there's just a little way that I can think about to make a smooth transition into this environment. Maybe do it by region, right? Like migrate your VMs, convert that stack, migrate your VMs back. Also, what's worth mentioning is that the data plane and the control plane functionality, those can be decoupled, each other, right? You don't have to use BGP in your control plane and your data plane. You can run that only to advertise your control plane VIPs, your public, your internal API VIPs, or you can use it only to advertise your VMs IPs. So those two can be decoupled. It means that if you want to take the benefit of distributed control plane and do it that independent. Yeah. And that's one of the key benefits which a lot of customers are going towards so we can put it on different racks, the controllers and all those things can be different yield locations and things like that. So that's a big advantage for some of the customers we're talking about. You guys run multiple availability. Yeah. Where do you run your controllers today? There are two of these, but they're all the same else too. Exactly. I mean, this is what we're trying to try to get here, get those controllers out of that one AZ and spread them around multiple racks. And it's to me, I mean, that is the obvious choice, right? That's what people wanted to do with OpenStack for a long time, but because of the RRP, right? And then pacemaker is very close relationship with it. You weren't able to do that. So you get HA for controlling GO HA to say, right? Without adding additional number of controllers. You have three controllers, you can still achieve GO HA data. Thank you very much.