 Good morning, everyone. My name is Mark Gray. I work for Red Hat and joining me today is my colleague Daniel Alvarez, who also works for Red Hat. And the topic of this presentation is Layer 3 networking with border gateway protocol in hyperscale data centers. So this is the agenda. We're going to talk a little bit about data centers and how they've evolved over the last number of years. We're going to introduce the border gateway protocol and then we're going to talk about the community interest in integrating Layer 3 routing protocols like the border gateway protocol into our products open stack and open shift. Then we'll wrap up with a brief discussion about future activities and take any questions from the audience. So I don't want to over generalize data centers. Obviously every data center is designed slightly differently. However, we do want to talk a little bit about how data centers have evolved over the last couple of years. And this table is meant to highlight that. So in terms of network topology, traditional data centers would have had networking arranged in a kind of a hierarchical manner, like an upside down tree. And that has changed over the last number of years to a more leaf and spine type model. And the reason for that is to do with scalability and it's ensuring the ability to scale the physical topology of the data center as required. The unit of computers also changed in the data center, whereas previously we would have been deploying our workloads and applications in over provisioned servers within Iraq in the data center. And now obviously our applications are hosted in containers and virtual machines. And as such the density of IP addresses within Iraq has increased quite dramatically. Also the mobility of those IP addresses has also changed. Again, whereas previously an IP address would have been associated to a server within the rack. And now we have IP addresses associated with virtual machines that can be moved throughout the data center and also IP addresses associated containers which can be brought up or destroyed quite quickly. The way we build our applications is different. Previously we would have been building big monolithic applications hosted in a server and the predominant network traffic pattern would have been north side. So from a client outside the dentist data center into the data center, whereas now our applications are built in a distributed manner. And as a result, there's a lot more traffic between services within your application east west within the data center. Again, try not to over generalize, but typically the layer three smarts of a data center would be in the upper layers of the topology in the big core routing infrastructure. And the lower layers of the data center network topology would have been more layer two focused. So that means for layer two networks, the way MAC addresses and IP addresses are discovered within the network is through broadcast technology, things like ARP and MAC learning. Other types of layer two networking technology is also used in this type of data center. So things like spanning tree protocol to ensuring there's no loops in the networks, VLANs for segmenting the network segments and then various technologies for ensuring redundancy of network links. And these have all served us very well over the years, but they do present some challenges in the type of data center that I talked about in the previous slide with a large number of IP addresses that move around a lot and a lot of east west traffic. So some of those challenges are listed here on the left. So as we have a lot more IP addresses in and MAC addresses within a rack, the top racks which has a lot of extra pressure on its forwarding tables because it needs to learn and store the MAC addresses to these applications in its tables and its TECAMs. And as a result, the number of TECAM entries has increased quite dramatically, even greater than what's available in some top of rack switches. Also, because we're using broadcasting, you know, ARPs and MAC learning to discover MAC addresses for IP addresses, the convergence time when things change in the network has increased as well. It can take a long time when an IP address moves from one location to another location for all the various tables to update accordingly. And as the size of our Layer 2 segments is quite, is, you know, larger, as a result of that, our broadcast domain is a lot larger. So there's far more nodes on the Layer 2 network, which means there's far more broadcast traffic and also our Layer 2 failure domain is large as well. So all these problems are presented in Layer 2 networks when you scale them up to hyperscale. And there's various ways you can resolve this. However, one trend that we are seeing is the use of Layer 3 throughout the data center. So in Layer 3 data center, rather than have the routing smarts located at the top of the topology in the core router, we actually see those routers, those smarts distributed throughout the data center, even down to the servers themselves in the racks. And as a result, you need some way to distribute those routes around the data center. So there's various routing protocols and the one we're going to talk about today in this presentation is border gateway protocol. And as well, there's other analogous technologies needed in a Layer 3 data center for things like redundancy, you know, link failure detection. So things like bidirectional forward detection and detection and ECMP. And this does resolve some of the issues that we talked about in the previous time. In the literature, you do see people quoting convergence times of less than a millisecond. And obviously you can see how the L2 failure domains and the broadcast domains start to shrink accordingly as you distribute your L3 routing smarts throughout the data center. So as I mentioned here, there are a number of routing protocols that you could potentially use for this. However, the one that we're going to discuss here is the border gateway protocol. So what is the border gateway protocol? Some people may be familiar with it because it is the protocol, routing protocol used on the internet for exchanging routes. BGP is a routing protocol that allows you to exchange reachability and routing information between autonomous systems on the internet. And autonomous system basically is a collection of IP routing prefixes that is under the control of a single administrative entity. So in the public internet, an autonomous system may be a service provider or a large enterprise organization. You can also use ASs within your own network as well to segment your IP address space. And what BGP basically is, is a control plane protocol that distributes these routes amongst between autonomous systems using TCP port 179. There are two flavors of BGP. So there's interior BGP and exterior BGP. Interior BGP is used for exchanging routes within an autonomous system. And exterior BGP is used for distributing routes between autonomous systems. So now I'm going to hand you over to Daniel Alvarez who's going to talk a little bit about the community interest in integrating Layer 3 protocols such as BGP into our products. Thank you Mark. First, a quick introduction from my end. This is Daniel Alvarez. I've been working with Red Hat for the past four and a half years or so. And my main focus has been OpenStack networking and later on OVN, OVS. And I'm taking over from now with this slide which tries to capture the most recent and relevant community interest around the subject where I have included a couple of mainly least discussions in OpenStack and also OVS and OVN. Some specs that have been proposed for OpenShift and probably the most recent one is the one that happened like one month ago during the OVS and OVN conference 2020 back in December about OVN with dynamic routing, which included a presentation by Nutanix where EVPM based solution was presented. So this has attracted the interest of the community and the goal with this is for us to have a clear idea of what is the community trying to solve and how they're trying to do it so that we can help through our suite of products to achieve this. In this slide, I'm going to go over the main use cases in OpenStack summarizing there's three main pillars. The first one would be advertising host routes either slash 32 or slash 128 for IPv6 to the actual VMs on provider networks. A second one would be to advertise tenant networks. I will go through it a little bit later and then advertising float in IPs for the IPv4 use case. There is currently an existing project upstream called Neutron Dynamic Routing that covers pretty much this but it lacks an important feature which is the advertisement of host routes to virtual machines on provider networks. So it does only support tenant network advertisement and it only advertises the whole subnet so it will advertise slash 24 slash 26 or whatever subnet you may have through the Neutron virtual router. So the next hop of that subnet will be always the Neutron gateway port. So this is not supporting provider networks as I said and there is a few more cuts to this. Like for now even its architecture is pluggable and it supports to add any backends that you may want. The current reference implementation is a simple Python BGP speaker implementation which we don't believe its production grade. We haven't seen many real world use cases for this project so we are considering whether it's worth to try to revive it or implement a back end. We use the API up to some extent and this is merely what we are today and with this in mind I'm going to move to the next slide. A little bit walkthrough of what we had in mind for what design for OpenStack would look like. Keep in mind that this design when we initially worked on it basically is just architectural work what we have done is to be reused by OpenShift as much as we can and this is why by having OVN as a common networking layer for both the thing that would make more sense here would be to use an OVN daemon which naturally monitors the OVN southbound DB which includes information about the workloads, where do they reside and so on and so forth and doing some configuration dynamically in the host to advertise this route. So in a nutshell what we are trying to do here without any modifications of OVN or OpenStack would be to run this OVN daemon on each hyperbiter on each compute node and configure on a dummy interface, configure the IP addresses of all the VMs that put on that hypervisor. This will trigger whatever route dynamic routing solution which we are aiming to use FRR. So this will trigger FRR to advertise to the BGP here the routes directly connected routes to those VMs and we are assuming absolutely no layer 2 connectivity outside the rack so in the first approach what we are aiming to is to having two nicks using ECMP routes to the top of racks and those top of rack, those networks would be slash 31 you know basically point to point networks to the top of rack just for the BGP sessions and to steer all the traffic through that network. So the goal is to advertise those VMs that are going to be put on the hypervisor. As I said we monitor that through this daemon in a directly connected fashion and then we need to steer the traffic to the kernel in order to do this routing. So the routing to the actual top of racks will happen at the kernel level so what we need to do is to perform some sort of like proxy ARP or NDP in the case of IPv6 in order to perform this fallback to the kernel and then the OVN daemon will be responsible for having the necessary routes and in some cases even like some ARP static entries on the host to be able to steer this traffic in and out from and to OVN. The goal as well is like all the traffic which is inside the compute node doesn't necessarily need to go to the kernel so this would be more efficient than other solutions that we have seen so all the local traffic to OVN will keep inside OVN when it needs to go outside the hypervisor it needs to go back to the kernel and from there using ECMP routes to the NIC. But it of course comes with some limitations and this slide is trying to capture some of them being the most obvious one probably the fact that we cannot use overlapping IP addresses so it's limiting the multi-tenancy even though OpenStack has some mechanism like the address scopes to ensure that there's not going to be any overlapping across the different tenants that belong to the same address scope probably we can overcome that limitation with other solutions like EVPN but EVPN is more about stretching the L2 domain so it will come at a cost as well like traditional layer 2 problems that Mark mentioned earlier like broadcast traffic size of ARP tables and these type of things it will also force the users into having EVPN capable upstream devices and configure them it's probably harder to deploy somebody needs to be responsible for the VTEP endpoint creation the VNI assignation and so on and so forth so while this is something that we are not fully discarding today this presentation is putting it aside and focusing on the scenario that I've mentioned earlier another limitation is definitely the use of accelerated data paths SRV is definitely something that cannot be used in this design because SRV is about skipping the hypervisor so we cannot apply these techniques that we described to full the traffic being routed in the kernel and this is definitely not going to work in this design OVS DPDK will probably work but the fact that we need to fall back to kernel routing to get out of the hypervisor is probably going to impact up to some extent on its performance and this is something that today we don't know about so overall whether you use DPDK or regular kernel data path we don't know how this is going to affect performance and it's something that we really need to take into account when we are going to productize this type of solution and now I'm handing it over back to Mark please Mark take over Thanks Daniel for that so moving on to OpenShift OpenShift is at an earlier stage than OpenStack for sure and there has been some discussion within the community about the various use cases that would be applicable to BGP with OpenShift and there's a great document here an enhancement proposal that was kicked off by Russell Bryant I've linked in this presentation that discusses those use cases and here I've added some of them to this slide so for example L3 redundancy for nodes we can use BGP to help distribute routes to nodes so that nodes can determine what their next hop is and that could be used for load balancing or redundancy purpose as well we could have multiple routes outside from the node also we can use BGP as a way to load balance traffic between services in an OpenShift cluster typically in OpenShift you need to use an external load balancer in order to do that and maybe BGP could be a means of doing that also for exposing pods or services directly so when a pod or a service is made available in the cluster we can use BGP to publish an IP address and a route to the rest of the network for that pod or service and perhaps BGP could also be used as a way to interconnect different OpenShift clusters and there's others in this presentation and we also want to hear from other people maybe you have some ideas in this respect and see what those use cases are so I'd encourage you to reach out to myself or Daniel or for example commenting on the enhancement proposal in terms of design again we're very early days we are starting to think about how we could integrate BGP with OpenShift using oven we want to follow a model very similar to what OpenStack is doing and ideally we would like to reuse a lot of the components between the two products however we'll have to wait to see if that is actually possible and as well we'd like to use FRR free range routing as the routing daemon on the node but the approach is very similar we want to have for potentially thinking about having some agent that would sit on the node listening to the oven databases probably the southbound databases for any change in the oven configuration and then using that to reflect those changes on to the FRR daemon and that potentially can publish those changes out and also vice versa so if a change has been made to FRR we can then use that to make some configuration change to oven as well so just handing back over to Daniel and just to wrap up this slide is about the next steps but really everything that we have been talking throughout the whole presentation is about the future because everything is on very early stages so the main takeaway here is probably that we need to keep gathering more feedback from the users and from the community in order for us to best identify what are the main requirements use cases and what kind of challenges that we have ahead as well as how we can help solving them on the more technical front some of these things that come to my mind could be performance and scale testing and other aspects such as avoiding extra hubs by having a distributed northbound routing or come up with a proper API design that provides enough flexibility for consumers these are just examples hopefully this presentation helps gathering this feedback that we have been talking through and I want to thank everybody for attending and listening all the way to here so thank you so much and see you next time