 Good morning. Welcome to our talk, Distributed Routing in Ironic Integrated Open Stack Cloud. I'm Rajiv Grover from Hewlett Packard Enterprise. And this is Vivek Narasimhan from Ericsson. Our third speaker, Maruti Kamat, couldn't make it to the presentation because of personal emergency. So it would be Vivek and myself delivering this talk. So just to start off, how many of you attended Jonathan Brice's keynote address on Monday? So he shared a quote, which I would like you guys to read through because it kind of speaks into the material that we're going to be covering today. So I'll give you a moment to read. So a few summits ago, Distributed Virtual Router, DVR was submitted and was contributed into Neutron. DVR brings great efficiency and scale into a virtualized workload. DVR disaggregates the routing function in such a manner that most of the routing is done within the compute nodes where the source VM is in. So this way, it eliminates extra hops. It also scales as the compute nodes scale. And it also gives you a fault containment in terms of that the fault is contained within the compute domain. If something happens, it doesn't take the whole network down. Only typically, you are impacted on a compute node basis or even smaller. That works out great for a workload that is virtualized, so it has mostly VMs in there. But when it comes to a workload that includes VMs as well as bare metals, there is a gap that needs to be bridged. So just to put things in perspective, bare metal servers are here to stay for some time because of a number of use cases. Some of the prominent ones are 3-tier architectures where the third tier is a high-end database server. There could be policy, security, or compliance requirements which necessitates having part of your workload on a bare metal server. Then there's a situation where specialized hardware is involved, so like offloads, rendering, high compute, kind of scenarios. Sometimes, it's as simple as that there's a legacy application which scales up pretty nicely. But when it comes to scale out, it's not architected for that, so bare metal is involved. So OpenStack has realized this and has been actively working over the years to integrate VMs into the cloud. So projects like Ironic have put quite a bit of work in there. In Kylo, L2 Gateway was contributed into Neutron to bring VMs in on a VLAN. So continuing on that work, we want to fill up this gap, have a workload that includes virtual as well as VMs, still be on a distributed routing plane. So we are here to share an approach, how to go about doing that, and we're going through that. So to start off, we'll just do a quick overview of these components that we build upon and then get into the nitty-gritties of the approach. So in this presentation, we'll be using these diagrams quite a bit, so just a quick word on what this layout is and a couple of terminologies. So the rectangle on your left represents a compute node. The colored boxes inside the rectangle represents tenant VMs. The color of the box shows what network it is. So like VM1 is on a red network, and VM3 is on a green network. Then we use the term east-west to specify traffic which stays within cloud. So an example would be one VM talking to another VM. We use the term north-south for traffic that is exiting or entering into the cloud from outside. So a VM accessing an external server would be north and an external client addressing a service hosted in the cloud as the south. So let's start off with the DVR overview. For east-west, I'm going to take an example here. In this case, we have VM3, which is wanting to communicate with VM2. VM3 is on green network. VM2 is on the red network, so obviously it requires routing. So the way the traffic flows is VM3 arps for its gateway. And there is a local router presence. That's the DVR architecture, which responds to these arps. Therefore, the packet from the VM3 lands upon the local router. The local router switches the network from green to red, looks up the MAC address of the destination VM. That's VM2 in its ARP tables that are pre-populated, and forwards the packet out through the bridge. One small thing the bridge does is it substitutes a unique MAC as the source MAC just so that underlay does not get confused where the DVRs are. And the frame is sent out onto the underlay into the second compute node. When it reaches the integration bridge in the second compute node, the integration bridge does the reverse. It substitutes the gateway MAC into the source MAC field and bridges the packet to the VM2, and the communication is done. So that's the east-west path. Now let's look at the north-south. North-south, there's two use cases. One is the floating IP, and the second is the SNAT. So we'll start with the floating IP. In case of floating IP, here's a VM3 wanting to talk to an external server through the external network. So again, it reaches its local router because that being the gateway. The router realizes there is a floating IP associated with this VM, so does then adding to the floating IP and forwards the traffic right out the floating IP namespace to the external network. Now let's just take a quick look at the SNAT path. So in this case, there is no floating IP associated with the VM. So it's going to be typically use cases like upgrade and stuff where all connections are initiated from inside the cloud to the outside. So the VM3 sends out its packet again to the router. The router realizes there is no floating IP associated with this, so this traffic needs to go through the SNAT path. It redirects the traffic to the SNAT namespace, which is residing in the network node. The packet reaches the SNAT namespace. SNAT namespace does the source adding and also adds it to the connection tracking. And the traffic goes out to the external network. So that's the DVR overview. I'm not going to hand off to Vivek to cover some of the other components and start off into our proposal. Thank you, Rajiv. So I'm going to move to some of the components that we'll be using in the solution. So one of them, before we move to the components, I just want to give a brief overview of what is OpenStack ironic for people here, because the distributed routing solution that we are talking in this presentation is primarily to enable distributed routing to work with ironic-enabled clouds. So ironic is an OpenStack project. It provides you the ability to provision bare metal servers as opposed to virtual machines. So it gives you a lifecycle via OpenStack to do end-to-end bare metal provisioning and management. And then this particular project was integrated into OpenStack from the Kylo release. It became official from Kylo release. And when it initially became an official project, it was supporting only flat networks. Basically, you can spin off bare metal servers on flat networks. And then you should have the virtual machines also running on flat networks. And then you can actually make them communicate. And recently, as part of the engagement between the Ironic team, the Neutron team, and the NOVA teams via collaboration, we actually managed to enhance Ironic to support work on VLAN-based networks. So as you know, flat networks do not give you a multi-tenant isolation. However, VLAN-based networks, VXLAN-based networks, or GRE-based networks, so they actually try to bring in this multi-tenancy isolation. So the tenant isolation was embarked upon and completed during the Mitaka release for Ironic use cases. And basically, the support was given only for VLAN-based networks. So this is a brief overview of Ironic. So I'm going to actually give a quick picture on how an Ironic-managed bare metal that's sitting on a VLAN network would be enabled to communicate with a virtual instance that's running on a hypervisor. So primarily, in this use case, what happens is like, let's assume the fact that the VM-1 is already spin-off and running on the red network. That network is a VLAN network. And then we could also say the customer is trying to bring up VM-1, which is a bare metal instance, using Ironic. And when he tries to actually bring it up using a NOVA boot command by giving the bare metal flavor, like at that point, the NOVA's Ironic virtualization driver gets excited. And then that actually talks to Neutron and tells Neutron which is the switch and which is the port on the switch to which this bare metal is attached. So Neutron actually then takes this information that comes as part of the CreatePort API. And then it goes and provisions the top-of-rack switch, usually, to which the bare metal is connected with the VLAN, which is the segmentation ID for that network on which the bare metal is spin-off. It's on the same network, the VM-1 is also spin-off. And so this now kind of enables VM-1 to talk to VM-1. So this is the VLAN network isolation use case. So it is this primary use case that was pursued and completed in the MetaKa release. And now I'm going to talk about something about L2 Gateway. I need to talk about this because while we looked at VLAN network isolation in the last slide, by the virtue of the lifecycle that we made to happen in Ironic, NOVA, and Neutron as part of MetaKa release, we were also able to actually enable VMs that are on a VX LAN-based network to be able to talk to bare metals that are actually present on VLAN-based networks. So before we embark on seeing how the virtual instance can communicate with bare metal instance, even though they are using actually different underlays, I want to give a brief about what the Neutron L2 Gateway is about. So Neutron L2 Gateway is actually an entity that allows you to bridge two segments. And by enabling such a bridging, it will provide you the semantics of a single L2 broadcast domain. So primarily, it tries to retain the semantics of a Neutron network. Like segments that are being bridged by an L2 Gateway, these segments, basically these segments or networks, they can be Neutron orchestrated, which means they can be Neutron networks. Or they can be VLAN segments that existed prior to a customer moved on to a cloud. Basically, they could be an enterprise VLANs that are being used by a customer before he actually embarks on transitioning to using a cloud to run his workloads. So segments that are bridged can be Neutron orchestrated, or they can be segments that are not known by Neutron also. So then there is a concept of multi-segment network in Neutron. So multi-segment network is a mechanism through which you can actually compose multiple segments into a single Neutron network. So typically, when we spin off, when we actually create networks in Neutron, you can create VLAN networks. You can create VXLAN networks. So similarly, there is one called multi-segment networks. So you can create a multi-segment network where you can actually provide multiple segments as part of a single network. And we will be actually using that to enable VM to communicate with VM when these two guys are on actually different underlays in the next slide. And typical deployment use cases, if you see for L2 Gateway today in customer environments, is to bridge traffic that is generated by VMs that are on VXLAN or GRE segments to the bare metal servers that are actually running on VLAN segments. And then L2 Gateway, as a service, was actually made available from the Kylo release of OpenStack. So just to go through it, as I told in the last slide, a multi-segment of we will be using a multi-segment network mechanism. It will have one VLAN segment, which is going to be used by Ironic ML2 drivers and Neutron to actually go and plumb the networking for the bare metal that's coming up. And then that same multi-segment network will have one VXLAN segment on which is the segment that will be used to actually have the VMs and their VMs to be bound to this VXLAN segment. So virtual instances will continue to transmit and receive packets on VXLAN segments. Bare metal instances will transmit and receive packets on VLAN segments. Both those segments are part of a single multi-segment Neutron network. So now we will be using the L2 Gateway to provide the bridging between this VXLAN segment and the VLAN segment on that single multi-segment network transparently, which means the customer need not actually go and configure it, create an L2 Gateway, put these two segments in it, and then enable the traffic to go through the L2 Gateway device. So this will be actually taken care transparently by Neutron, by the virtue of it, figuring out that we have a multi-segment network where some bare metal ports are available created by Ironic. Bare metal network ports are available created by Ironic. And no, we also have certain virtual machine spin-off on the VXLAN segment of that same Neutron network on the virtual cloud side. So if you could see here, in this use case, actually, so when VM1, which is actually on the multi-segment network and on the VXLAN segment, tries to transmit a packet, it will actually send the packet on a tunnel towards the L2 Gateway. The L2 Gateway receives the traffic and then actually decapsulates the tunnel headers and other things, and then creates a VLAN frame and then actually sends that VLAN frame to the bare metal. So here the primary idea here is the ML2 drivers of Neutron would actually configure the red circle shown in the L2 Gateway. They will configure that particular interface with whatever is the segmentation ID for the VLAN segment of the multi-segment network where that VM1 is residing. And then similarly, as part of the VM1 spin-off, the usual open V-switch mechanism driver that's available in Neutron will take care of making sure that the VM1 actually is put into the VXLAN segment of the same multi-segment network. So the key here is a single multi-segment network is used to actually spin off a bare metal also and used to spin off a virtual machine also. And we use L2 Gateway device as actually the bridging entity that will provide you the Neutron network single L2 broadcast domain data path semantics. So I'm actually now trying to move into distributed routing. So what we saw all along was just to prepare the context for what are the elements that we will use in the solution to provide, to extent, distributed routing to ironic integrated clouds. So we talked about how the DVR works, how East-West, North-South works with plain DVR in a VXLAN-enabled cloud. Then we discussed about what is OpenStack ironic, what is its role, what it can perform as that. And then we will also actually saw what is an L2 Gateway and how we are taking advantage of it to enable traffic from a virtual cloud to go into the bare metal network side. So now we'll go into distributed routing and the solution itself. So our initial goals for the upcoming Neutron release would be to extend the distributed routing concept to embrace the ironic managed bare metal servers. So that in turn would have some sub-goals as we see. Basically enable DVR on a VLAN-based tenant networks for ironic managed bare metal servers. We saw that switching is supported. We saw a VLAN network isolation switching. We saw VXLAN network isolation switching. So here we'll actually support distributed routing on VLAN-based tenant networks for ironic managed bare metal servers. Similarly, we'll also attempt to enable distributed routing for VXLAN-based networks for ironic managed bare metal servers. And while we do so, we'll actually be using the L2 Gateway component to accomplish that to enable the traffic translation across the VXLAN and VLAN release. Similarly, we will also attempt to retain the highly available distributed virtual routing that's in place today in Neutron as of Mitaka release. There is a session that's scheduled after this. It talks in detail about it. So we'll try to retain high availability for DVR, even for ironic integrated cloud deployments. So these are our initial goals as we embark on trying to go through this implementation. So let's see how East-West would work in our proposal. So you could see that let's take a simple use case where VM3, which is on network N1, which is a green network, would like to actually communicate with a bare metal VM1 that's on a different network N2, which is actually a VLAN network. So what typically happens here is the VM3 actually transmits its frame to its default gateway, which is on green network. And that default gateway is the distributed router that sits close to it on the same hypervisor. So that router actually receives the traffic. And then that router routes the packet back and then actually puts the frame itself onto the red network. And then the traffic is directed towards the L2 gateway device. And then what happens there is the L2 gateway device, as usual, decapsulates the tunnel headers and other things. And then it actually passes the VLAN frame to the bare metal instance. So this arrow shows the initial routing that happens when the VM3 actually tries to talk to VM1. And so this is how the data path flow happens for traffic that is initiated from the virtual cloud side to the bare metal network side. Now we will see the return traffic. Say if we have the initiator is actually on the bare metal network side. And say he wants to actually talk to a virtual cloud instance that's available in the virtual cloud side. So here we have a VM2 bare metal, which wants to actually communicate to VM1. Both are on different networks. And so what happens here is here I would like to introduce something called, you could see that there is a red arrow with DVRL circled out. So DVRL is a new namespace that we will actually be creating to enable traffic to provide data path connectivity from the bare metal network side to the virtual cloud side. So the role of this DVRL is to actually provide routing for all traffic that is initiated from the bare metal side and then pass that router traffic all the way as a switch to frame to the destination on the virtual cloud side. So in this case, typically what happens is when VM2 wants to talk to VM1, VM2 knows that VM1 is on a different subnet. And so it will actually try to figure out and try to actually R for its default gateway. So when it tries to R for the default gateway, the DVRL, which is shown in the red arrow, that namespace would respond with its own MAC address. And so the VM2 will now send its frame to DVRL. And then DVRL receives the frame and then routes the traffic from the green network to the red network. And then the packet comes out, encapsulated again on BXLAN on red network. And then it reaches the destination, which is VM1, which is on the compute hypervisor. So this is the reverse path flow when we have traffic initiated from the bare metal network side to the virtual cloud side. So now I'm going to actually leave it to my friend Rajiv to continue to discuss how not so this accomplished in an ionic integrated cloud. Thank you, Vivek. So continuing to the north-south direction, we again are presuming the same use models of floating IPs and SNATs. And as you've probably noticed as Vivek was going through the flows, the flows are very, very similar to how DVR works in a virtualized environment. And you will see that similarity more so in the case of north-south. So let's start with the SNAT path. So here we have a use case where the BM2 bare metal wants to access the external network. This is the typical use case, which we think will be most prevalent because this is for VMs wanting to upgrade their software. The connection is initiated from the VMs. VMs typically would not want to have their presence being accessible from outside, but they would be the ones initiating the connection. So in this case, VM traffic just goes as VLAN or whatever normal gets to the L2 gateway, gets out in a East-Mest fashion to its gateway, which is the DVRL, because DVRL responds to all the art for the gateways. Once the packet comes to DVRL, DVRL sees that it kind of visualize. One way of visualizing this is that the DVRL sees all these BM ports as if those ports were resident within the same node. So it would actually redirect the frame to the SNAT namespace. Once the frame gets to the SNAT namespace, the SNAT will happen, and it would be sent out onto the external network. I see some people are closely following it, so I'm going to give you a minute to absorb this. So in the SNAT namespace, there will be the SNAT as well as the connection tracking. Very much identical to how DVRL in a virtualized environment would behave. Now, let me move on to the floating IP case. In this case, there is a floating IP assigned to the BM. And again, the BM reaches out to its gateway, which is DVRL in this case. And the DVRL has it through IP tables and IP rules. Same as is existing today, determined that there is a floating IP associated with this traffic. So SNAT is to the floating IP, sends it out through the floating IP namespace out onto the external network. The reverse path will be very similar into the north-south. So that kind of explains most of the proposal. Now, we wanted to share some of the design considerations that went into this. We had some design principles and some things which we discussed and walked through. So first is we wanted to preserve the use model, be consistent with what's available in the virtualized part of the workload. So preserve FIPS, SNAT, services. So we think they should work with this approach. We continue to use the constraint of no-touch model that is not required changes into the BM, not requiring like installing of agents or any helper modules or any of those special configurations. Third was the architectural compatibility. We want to be within the OpenStack framework, Neutron framework, all, again, same software models being used here. It's a pretty small extension to the DVR logic that's going in here. At the same time, we want to make sure that other features continue to work with it, which already exist, and the upcoming ones like the address scopes and stuff. We don't think this affects, this proposal affects them in any ways. We considered high availability because if you notice, there is one instance of DVRL serving all the beams which are within that same routing domain. So we think the design pattern used for SNAT HA, as well as some other HA capabilities in Neutron, can very easily extend to this. Do not see any special needs as such. Then we get to the scalability component. So in these diagrams, we are mostly shown DVRL being resided in a network service node. But from a design consideration, we don't see that to be constrained. It can be in any of the nodes, as long as there's only one instance per router. So that would eliminate bottlenecks for any kind of scalability concerns. There's also, we are thinking there is approach in terms of scheduling, so intelligently schedule DVRLs to not being in a single node and being in different nodes. That should help from a HA, as well as scalability perspective. There were a couple of alternatives we were thinking about exploring. So one of the alternatives, which we see in future, may not be viable immediately, is to fold in the DVR functionality, like complimentary, into a hardware device, like the way the L2 gateway is hosted on a device. So OpenFlow has some constructs that would fulfill, but that still leaves a gap in terms of north-south. But it is a future option that can be considered and verb through. The other option is to move away from the no-touch model and actually install software into the VMs. So the reasoning, the rational behind that would be that VMs are typically compute-intensive. They are not network-intensive. The reason they are VMs. So having a lightweight agent in a VM might be worth the exploration, what it means and what it does. But that would be a future extension or an evolutionary path to go forward. So that's mostly our presentation. I think we are done. Thanks for attending. And we'll take questions at this time. We have two mics here. Please, if you have questions. Sure. I have three questions. So have you guys already completed the prototype development? Not yet. Have you guys already completed the prototyping? In other words, do you guys are working? Yes. So we have work going on to POC this? Wow. Yes. And that should be coming out sometime. And also my next question is, so I think it's better to place the L2 Gateway service onto the network node. What do you think about that? Oh, you mean move L2 Gateway functionality into the network node? Yes. Yes. Otherwise, so the network packet or the network frames goes through at least two nodes. Right. We can reduce the network. So let me address that in two ways. One is that is a L2 Gateway-specific question. Coming back to our proposal first, let me attempt from this. You would see this. It traverses two nodes only when the traffic is outbound from the VMs. The traffic that's inbound into VMs actually just goes directly to the L2 Gateway. There is no, it does not pass through DVRL. It's, yeah, if you go through the slide. Yeah, but, yes, this picture. So you guys denote, so wrote this picture. So as if this L2 Gateway looked like a physical switch. Yes, it is a hardware switch. Yeah, then, OK, then my third question was. And that's to complete your first answer. The reason L2 Gateway is in a hardware and is not on a network node is because it could be on a network node, but there's a translation required between VX LANs and the VLANs. And actually, it's. Yeah, in short, this is a logical entity. So that's why I thought, so you wrote like this. So just a convention to do a presentation. So you guys wrote like this. So just for the explanation or presentation to be understandable for the audience here. OK, yeah, definitely. Like this L2 Gateway represented here is actually hardware. There is no software L2 Gateway available in OpenStack yet. We were about to pursue that. But since most of the vendors were actually using their Tor switch to provide the L2 Gateway functionality, we decided to actually keep the software L2 Gateway initiative on hold. For instance, so I think that kind of network switch must provide a V-tip if the customers want to use a VX LAN-backed network. Yeah, in order to bridge the VX LAN-backed network and also the VLAN-backed based. Right, OK. And then that is your assumption, OK. Yeah, yeah. So basically, the L2 Gateway shown here, that will actually host the OBSDB hardware VTAP schema that's a standard. So by programming that schema, actually, you can create translation rules to enable a VX LAN packet to actually get it transformed into the VLAN frame before it exits the L2 Gateway switch. So that's what we are doing. Yeah, thank you very much. Thank you. I had a quick question about the next slide for the return path from the VM1 to the VM. I believe it was. Oh, this one? Yes. From the VM1 to the VM, right? OK. OK. Yeah, this one. That is the actual data path? Or is that the just-for-art kind of control plan? Yeah, this is the data path. This is the data path, OK. Thank you. Any other questions? Thank you for attending the presentation.