 All right. Good afternoon, everyone. My name's Brian Haley, and this is Swami Nathan. I work at Red Hat. He works at Seuss, and we still seem to get along, even though we're... So today we're going to talk about this very long title, Neutron Port Binding, and the impact it has on unbound ports with DVR routers with loading IP. So if you can remember that, then that'd be great. So first, a quick agenda. I'm going to do an introduction and a little background on the technology for Neutron with DVR and other things. And then we'll talk about some Neutron port binding, what it is, how DVR router scheduling works, how we handle the floating IPs with DVR, and then some of the limitations we have when we're using unbound Neutron ports, which is the issue that we're having with Octavia. And then we're going to go over a proposed approach that we have to handle these unbound ports in a different way, some of the design considerations that we made when fixing this, and then some of the future plans we're working on for unbound ports so that we can do this for other cases. So first, this is the shorthand name for the discussion bug number. Basically, this talk came out of this bug and the discussions around how to fix this bug. There's a very good explanation in here that we're going to try to cover if you're interested. And please go look at the bug, because Swami's written up a very good intro. So I'll cover a few of the background things on Neutron routing. So in normal legacy mode, we use what we call centralized virtual routers. Sometimes you'll see it as legacy in the documentation. And diagram looks more or less like this, where there's network nodes. Network nodes have all your routers. Network nodes connect to the external network and the internal network via tunnels to the compute nodes. Compute nodes have your virtual machines that sit on integration bridges. But when one virtual machine, say virtual machine one green, wants to talk to virtual machine two red, instead of talking directly to it, it has to go all the way to the network node across the tunnel through the router namespace, back out again, and all the way back. So this east-west communication is a pretty long path and that wire can get pretty hot, basically. So we made it better but complicated at the same time. So this is the DVR picture. And I know, I guess it's maybe probably easier to read up there, maybe not, but in DVR we solved two problems. The first one was the east-west where, say, VM one red can talk to VM one green. And instead of going all the way back to the network node in the green box, it would go onto the integration bridge, BR int. It would go into the router locally and it would get routed right there, right over to, back into the integration bridge and back up to the other VM. So that solved the east-west problem where we didn't have to go across the wire. And as well as going to a local VM, if it needs to go to a VM in another compute node, it gets routed locally and it gets sent directly to the other compute node across the tunnel. So network nodes are not used for east-west routing, basically. They're really only used for default SNAT traffic and a few other things. So, and also in this diagram we do north-south traffic. And I'll just skip to the next slide since floating IP is your north-south traffic. So how do you get your instance from the public network? You use a floating IP, DNAT makes us go in and out of the namespace, onto the integration bridge into the VM. So with the centralized router in the first picture I showed you, all the floating IPs are configured on the network nodes. But when we distribute the routers with DVR, every compute node has what's called the floating IP namespace. So all the NAT takes place on the compute node and it goes directly to the external network. So the north-south traffic doesn't go through the network node either. But as I know here on the bottom, a problem we had that we saw or what we're talking about here is that when we change the port association, so for example when you take a floating IP and move it from one virtual machine to another virtual machine, in the centralized routing case it really just changes an IP table's entry, maybe throws away a bunch of contract state and then communication continues. But in the DVR case when you re-associate the floating IP between instances, it actually has to tear down all that state on one compute node, has to build it up on another one, it has to send out gratuitous ARPs. That takes more time, ARPs can get lost, and so you can run into some issues there. It works pretty well, but in a few cases like that Swami will talk about, it actually caused us problems. So one other thing I'll give you some background on is allowed address pairs. It's a neutron extension for the ports that basically lets you associate a Mac and an IP address with a port, regardless of what sits behind it. So for example if you wanted to run a virtual IP you can create a neutron port with an address pair, you can associate or add an IP to it, and then you can have an instance with a different IP, use that virtual IP, and that way the traffic will flow through, it will be allowed to flow through to the instance where otherwise it will get blocked by the anti-spoofing code. So it allows you to migrate the VIP between two instances and do HA, and this is used by Octavia and other load balancing as a service things. And so Octavia is using VRRP, so they have two or more instances, they have a virtual IP, they're making sure the other's up, master always has the VIP, the slave is waiting for the master to die and taking over the VIP, and that's the general definition I just stole from Wikipedia. So now I'll give it off to Swami to talk about neutron port binding and... Yeah, so the way that we implemented DVR is basically completely dependent upon host binding and port binding. So in the case of ML2, we have different types of port binding, so the ML2 legacy routers have default port binding where you bind a port to a host whenever a VM request for a network port and then network port comes in and NOVA actually tags in which host actually is trying to plug in the WIF port. So that way the port has an address which means it has a host where it is actually residing. So that way we know this port is residing on this host. So in the case of DVR when we distribute the routers, we had a different kind of port binding other than the legacy port binding where we extended the port binding to do port binding for the DVR router ports because each of the router ports will still have a single IP but it'll have the same MAC but it'll be actually replicated on every node that we have and each and every node host information is actually bounded to that router interface. So that way we know if that port is bounded to a particular host, we go and create a router in those nodes. So as far as a VM is concerned, when a VM comes up, the VM port itself has a port binding where it says, okay, I am on host A and I am on host B. When it is on host A, we know for sure that host A has a VM and that VM network is part of a routed network so we need to go create a router on that node and then provide all those information to create either floating IP or any router update that comes in that actually goes to that particular node. But if there is no host binding, DVR has no clues of what to do because if a port comes in without any host binding, it says, okay, there is a port and you have actually created basically a floating IP or you create a VM, basically a port and then you don't bind a host to the port. Then you associate that VM to a network which is being routed but the router update even does not know to which host I need to notify in order to create a router for this object. So in this case the two important things that we need to note is basically the device owner and the binding host ID which is basically the attributes of the port binding table. So the device owner, so what DVR did is when we implemented the DVR, we took into consideration that we are kind of deviating away from the legacy routers. So the legacy routers had like router interfaces but we actually changed the device owner type to be like distributed interfaces so that whenever we see a distributed interface, it's a router interface that's being used by the distributed routers as well as when we see a device owner as compute none which is basically a VM in this case as a device owner, okay, then it's a VM so DVR has to service that ports so those are the DVR serviceable ports. We have a list of DVR serviceable ports for which the DVR has to provide service which is basically compute. We actually included the DHCP ports as well as the LBAS ports because at that time the LBAS was also included in V1. So we didn't have an issue with LBAS in V1 because V1 was not using VRRP through allowed address pairs. So this issue that we are talking about came in later when load balancer actually moved from V1 to V2 when they moved to Octavia and then it's not only the Octavia one because I have spoken to other customers who are generically using VRRP, using the allowed address pairs and having some kind of redundancy to use Kipal ID to do instance failover, those kind of things. In all those cases because those ports are not bound to any host because they are unbound ports and in that case when you assign a floating IP to an unbound post and then you have access through floating IP, DVR does not have any clues and it was actually kind of neglecting it because it doesn't know what to do. So since I was talking about the negligence of DVR based on host binding, so what I wanted to talk about is how the DVR router scheduler works today and how it's going to work for the addressing this allowed address pairs issue. So the DVR routers scheduler today, what it does is whenever a router has been created it does not do anything but when you try to add an interface to a router and then it immediately schedules the router to the DVR SNAP node. That is the similar behavior to the legacy routers where as soon as you add an interface to a router you go ahead and actually schedule the router to the DVR SNAP node or the network node or the centralized node whichever you call it. So once this router is scheduled what we do is in the case of DVR when a VM pops up on host A or host B the VM comes up and the VM port comes up so the ML2 plugs in the with port and there is a port update that happens with the host binding. The port update information actually triggers an event from ML2 to the neutron server says okay A I see a port here that the VM has been added to this port and you want to take any action on this one. So a port update even comes to the neutron server and the neutron server sees okay there's a port update then I need to do take any action and looks at what's the type of port and is it associated with the router? Yeah it is associated with the router and if it is a DVR router and these ports have a service device owner which if it is a compute or if it is an LBAS or if it is a DHCP then we know that these are the ports that we need to provide service to. So if these ports are there and if there is a valid host binding on those ports then we notify that host saying that hey you have a VM right now that where it requires a routing service so go ahead and start your routing service there. So that's how the notification goes from neutron server to the agents. It's not a scheduler because we have kind of splitted the scheduling aspect from that of notifying the agents. This was done like two cycles before. So it's automatically scheduled to the DVR SNAT agent nodes which is the centralized nodes and then the notification goes to the host that are actually hosting those VMs where the router is required and once the notification goes the agent then actually process the notification and then goes back again to the server says okay give me back all the information about the router. Okay give me all the information like sync data. So when we sync the data when we provide the full details about the routers then we fetch in all the information about the router like if it has floating IP if it has all those information we collect all those information and give it to the agent. And even in that case we basically check if it has the host binding on it. If it does not have a host binding we say okay there's nothing to process for this floating IP just forget about it. Yeah. So this is the flow I talked about for this current scheduler. So if you can look at it so you can see that there is a router create and then add interface and then router update event actually is triggered and then the scheduler schedules it to the network node then if there is a valid service port and host binding there is a notify host to create routers is being sent to the agent that actually has it and then the agent actually request for a sync data and then the server again sends all the information to the agents regarding what it has to do. So it's the same case for floating IP as well and a floating IP is created again sends router update event to the agents that has actually this router configured and then it request for a sync data and if there is a valid service port and host binding then we again send all the information to the host. So this is how the current scheduler works. So in this case if you see the scheduler is completely dependent upon the host binding and the device owner. So that's the reason we have been missing this floating IP that has been configured for the VRRP port or the allowed address spread port that was actually utilized by the Tavia. So this is the slide that captures about the limitations because we, as I mentioned, the DVR routers has a tight dependency on ports host binding to be scheduled in compute host and then unbound ports are kind of untouched and we don't worry about those things. So VRRP ports or allowed address spread ports are not bound to any permanent host so we don't take any action on that. So again, when floating IPs are configured on these ports we neglect those ports because they are not host bound. So I think this is, yeah, again with respect to DVRs floating IP has a sequence of events that needs to happen in order to create a floating IP and provide North-South access for DVRs because when you configure floating IP as I said the scheduler now knows okay you have a host binding then I need to go create the floating IP namespace on the compute host where the VM resides. It needs to tie the router namespace to the FIP namespace and then once it ties the router namespace to the FIP namespace it actually configures the rules the DNAT rules on the router namespace and then it sends a GR message from the FIP namespace to the outside world saying that hey the IP address that you are looking is in this node. So once the GR message goes in then the external network knows okay the floating IP the private IP is behind this node and then any traffic that comes in actually flows through. So these are the sequence of actions that needs to take place when a floating IP is configured. So it's a time consuming process it's not a quick failover or a quick create and delete event. So it needs to create FIP namespace, add rules, send a GR and then everything will flow in. So for the float I just talked about the floating IP the router namespace and the FIP namespace. So this picture shows you in a compute host there is the uppers dotted square rectangle that you see is the router namespace and the bottom one you see is the FIP namespace. So from the router namespace there is a vEath pair that we actually try to connect to the FIP namespace the RFP and the FPR port. It has an IPv4 link local address between them and the traffic is actually chained. So any traffic that comes in we have an IP table rule on the queue router namespace for the floating IP that's being configured. So if the traffic is coming from that source IP we know that source IP has a floating IP and we forward all the traffic to the RFP port which is being directed to the FIP namespace and the FIP namespace internally has floating IP gateway port which is an IP that's been consumed from the public network that you have configured. So once we have the public IP in there all the traffic goes out and the traffic coming in enters the FG port and then we again know okay this traffic is coming in for this source IP and then we know in which router namespace that actually FIP namespace is configured and we forward the traffic to that one and it goes in. This is just for people who are unaware about how the FIP namespace and router namespace works. There may be some audience in here who might already know about this one. Sorry for if I'm repeating it. So again this is the same case for the network node. In the case of network node we have a separate SNAT namespace through which the SNAT is being configured and then the DNAT is basically done in the FIP namespace. In case if you are actually running a NOVA agent on the network node and if you have a VM in there and if you are configuring a floating IP then you have a floating IP namespace as well as the SNAT namespace. So I kind of briefed through about the issues that we had with the design aspects and what are the things that consumes time. So in this case we actually went with two different models. One is the distributed model and the other one is the centralized model. If we wanted to go with the distributed model for supporting the allowed outer space the issue that we had was it's a slow failover and because if you have a VRRP port and if you're trying to ping it and if you want to have a short time to actually migrate it to the new VM we cannot, with the time mentioned or the required by the KIPAL ID it was not as easy for us to create a FIP namespace and then add these rules and everything to go through. And again in that case if you wanted to have this one working we need some sort of ARP message coming in from the VM actually that is actually switching the IP from, switching the Mac from one VM to another VM for this VRRP port. And once we get the ARP we need to actually pull for the ARP in the outer namespace and then when the ARP message comes in then we need to go ahead and create the FIP namespace and then send a GRP message and say okay now this IP and Mac is serviced on this node and here is the GRP message. And then when it switches back then we have to go rip off this one and then recreate it which is a tedious job and again it does not solve the problem of high availability with respect to VRRP when you're trying to solve this problem. So what we thought is the easiest and the reliable solution and fast failover is basically to use the centralized node to do this unbound port FIP feature. So any port that comes as an unbound port with does not have any host binding we actually go ahead and actually create a FIP namespace on the centralized node, not on the distributed node. So it will be always centralized so any traffic coming from outside it can hit the centralized node and if it needs to reach the VM it can actually basically use the east-west path to reach the VM. So this is the one that I just mentioned so we have different ways of doing it but right now what we're doing is we are trying to do it through network node and then the D NAT rules are actually configured in the SNAT namespace, not on the FIP namespace because the FIP namespace is kind of DBR specific and then this one, since it's an unbound we don't want to combine both unbound and bound ports so what we are doing is we are creating these unbound FIPs on the SNAT namespace not on the FIP namespace and currently we have these two patches right now up for review, one is a server side patch and the other one is the agent side patch and here is the new change for scheduling the unbound ports so what we are doing is when the floating IP is configured when a router update request is sent out before the router update is sent out what we do is we check for if the floating IP has any unbound ports on it if there is an unbound port what we do is we assign a host we actually, we don't send the information notification to all the agents we only send the notification to the DBR SNAT agent which is basically the network node so the notification goes to the SNAT agent and the SNAT agent receives the notification and it sends a sync request back to the server and then the server responds back saying that okay for this one because it's an unbound port I am going to tag it as an SNAT bound port and then it sends the request back and when the agent receives the information as a SNAT bound port it goes and actually creates the floating IP rules in the SNAT namespace so this is how it's going to be implemented so if you look at this one this is still a distributed model but what we are going to do is we are going to implement the floating IP in the SNAT namespace and any traffic for the unbound ports and any traffic that comes in will actually take the east-west path to reach the VMs so in this case I have shown in the picture like the green VMs VM1 and VM3 has the same allowed address pairs that are being configured and those are unbound so it can actually either go to host1 or host2 but when we configure the floating IP for that allowed address pair it will be configured in the SNAT namespace and traffic can actually go through that one because our SNAT namespace has the SG interfaces which is basically on a private IP on the same subnet as the virtual routers are so it can actually direct the packet directly to the VMs so if you look at the network namespace on its own on the network node so the rightmost one or the red one in here is basically the legacy router the legacy router has the QR interface and the QG interface and both SNAT and DNAT rules are being configured within this namespace or in the case of the green one is basically the SNAT namespace that the DVR uses and what we are doing is we have an SG interface which is basically the private interface and the QG for the gateway and both our SNAT and DNAT rules will be configured but here the DNAT rules will be configured only for the unbound allowed address pair port pairs and the QR is basically on the QR outer namespace so this is basically this will give you an idea about how it's currently designed and how the traffic will flow again this is a case where this is a unique case it's basically just a test case this one has been designed and allowed right now just for the sake of DevStack because if you have a DevStack node installation if someone is testing it on a single node installation in the case of single node if you want to do both network node and NOVA on a single node then it's an all-in-one node where you have you can have floating IP namespace as well as the SNAT namespace everything in all-in-one so this is a pretty easy all-in-one node where you can actually test your test cases for and the future plans for scheduling the FIP for bound and unbound ports so since we had this ActaVIA and the VRRP issue there was a lot of concerns in the community about how are we going to do this one because all we are doing is doing this is without letting know the customers or the administrators where this FIP is being configured because when an unbound port comes in the admins may think the DVR is always supposed to distribute it so wherever my VM is I would expect my FIP to actually go and reside on that VM host but in this case we are actually moving to the centralized node so what we did is in order to give flexibility later down the line it's not implemented yet but there is an RF fee that has been added because there was also a requirement from someone saying that okay can we have the floating IP to be configured configurable where you can actually have floating IPs either in the centralized node or distributed so this RF fee actually deals with the configuration option where you can actually configure floating IPs either in the centralized node or distributed or it can be dynamically switching back and forth if there is no valid agent available on the compute host it actually go ahead and implements it on the centralized node so that is the RF fee so if you have any questions on the RF fee you can go and reply to that RF fee bug that we have filed so we will be working on this one either the PyCycle if you have some time otherwise it will be done on the Queen's Cycle so that's all we had to share and if you have any questions feel free to ask us Hello, thank you for the presentation I have a couple of questions first one this DVR functionality and everything you've been telling right now is it already in place in any release of OpenStack? Yeah, the DVR functionality has been introduced in Kilo so we are almost like two years now, 200 years and you don't have to do anything to explicitly enable it it's just there and it's just working I think you need to enable the agent mode, that's it when you start the agent mode you need to start L3 agent on all the compute hosts because in the case of legacy routers you only have L3 agents are running on the network node but in the case of DVR node you need to have L3 agent running on every node and then the L3 agent has two different modes of operation one is the DVR SNAT mode and the other one is the DVR mode if it is a DVR mode it's compute only mode there won't be SNAT configured on that one but if it is DVR SNAT then it allows SNAT to occur Okay, and one other thing that I'm missing about this data flow in case of SNAT for example we are having private IP MAC address on each host the same and public IP MAC address I don't know if that's the same or different on each computer For SNAT we still have centralized we have not distributed SNAT we only have the floating IPs distributed but in the case of floating IPs what happens is each and every compute host each compute host will consume one of one public IP address for the FG port that I showed For DNAT or SNAT For the DNAT Yes, floating IP, that's pretty clear and straightforward So for SNAT you are saying all the traffic would still go through the centralized Now what about for the east-west traffic routable still it's going to be a default gateway, wouldn't it be? Yes, there's a default gateway we use the same IP and same MAC but what happens is like when the traffic actually goes out of the host we actually hide the source MAC of the router we don't send the source MAC of the router outside the wire because before it hits the tunnel we have an open-flow rule that actually swaps the source MAC which each host has a specific DVR MAC that we have so that MAC has been substituted for when the traffic goes out and when it comes in we actually strip out the MAC and then we swap the MAC It's routed locally so you consider it asymmetric so on outbound it's switched on the local compute node and on the reply it's switched on the remote compute node So even though the IP and MAC are same on all the nodes we don't expose those IP and MAC outside Okay, got it, but how would... So now we have IP address of default gateway and MAC address of the same and you end up with... and the layer two is connected between all the nodes all the compute hosts So how come layer two bridging does not get crazy because of having multiple MAC addresses all in all the different... same MAC address in all the different places? No, when you say different places we don't even expose beyond the host it's within the BR int Right, yeah, there's no full rule to rewrite After the packet crosses the BR int that MAC is not being sent out of the BR int It's only exposed in the integration bridge If as far as it is within the integration bridge it knows that it is within this host it doesn't go outside Thank you Okay, welcome The floating IP conception on the compute nodes it's when it consumes that it's already configured or when some vrouter namespace is spawned on the compute node then it consumes that When you say consumes what do you mean by... like when it is spawned, it's already... The DVR router? Yeah, oh, okay So what we used to have is the floating IP namespaces are created Okay for the first VM that pops up on the host when the first DVR serviceable port either compute VM or load balancer service on that host whenever the first VM pops up that's when and if that VM's private IP has been configured with the floating IP it is at that time we go ahead and create the floating IP namespace and the floating IP namespace is specific for an external network each external network will have one floating IP namespace per host Already configured Yeah, not already configured like when you configure a floating IP and you say, okay, go ahead and update my router with the create a floating IP and associate a port it is at that time the server actually notifies the agent saying that there is a VM on your host and your VM's IP address now has been configured with the floating IP go ahead and create all the rules for that and then now our agent that's running on the compute host will actually go and create the fifth namespace and then it will create all these plummings but once the fifth namespace is already created the second VM that comes in and if you create a floating IP it doesn't recreate the floating IP namespace because the namespace is already there all it does is it actually recreates the rules that's it So it consumes only one IP? It consumes one public IP and we also have a work through on that Yeah, that was a problem because if it was a true globally routed IPv4 address people didn't like us using them so in a Newton cycle we actually added a feature where you can have two subnets on the external network and you can tag one as being the one you use for your routers because their data probably isn't gonna leave the data center and then the other one you can use for floating IPs so it's called subnet service types was what we called it and you can tag the subnet as usable by certain device owners and I know it's on the OpenStack docs page there's an example on how to use it and you know Thank you You're welcome Any other questions? So yesterday there was a discussion on how to distribute SNAT and avoid dependency on the centralized network node so the approach that you described for unbound ports is it adding I'm just wondering if this is adding one more dependency on the centralized SNAT or what happens when SNAT gets distributed how does it If SNAT gets distributed we can actually completely get rid of the FIP nameswiz that we have today and we can reuse the SNAT nameswiz for everything even for the unbound IC So it makes our life easier because the way that we designed initially to split the SNAT was basically to have the because at that time the service function chaining was not there so we were having VPN as a service running in the network node so we wanted to support the VPN as a service we need an entry point or an endpoint for a VPN as a service for your cloud so we thought that having the SNAT functionality to reside in a centralized node always makes sense so that's why we left it there in the centralized but if you wanted to distribute then we don't have and if we are ready to consume or burn IP addresses on a pulled out basis then it's easy for us to move the FIP into this one because the FIP logic is already there just turn on the flag and then use that one but I think in that case because it depends upon what is your use cases because if some people say okay I don't want to burn IP addresses because we don't have a right solution yet without burning an IP address but if you are ready to burn an IP address then this can be moved into that one because the only reason is with the same model we cannot use with once consuming with one single IP because the SNAT traffic will be shared because in FIP namespace we are sharing but we are actually preventing it through connection tracking so that way we are preventing the we have some kind of security but in the case of SNAT I don't think we can share those things so you need one IP per router per router right even per yeah so thank you you're welcome any questions thank you guys for attending it if there are no more questions thank you and we both will be on the IRC channel if you guys have any questions please you can shoot us