 Hi everyone, this being the networking track, we are going to continue the discussions around networking, going quite deep into the data path this time. So we're going to look at OVS data paths, also view router, and how flexible they are, how we can make them faster, and specifically also how we can extend them using P4 or C languages, in other words how can we make them more flexible. To do that we'll delve into the traditional OpenStack networking options that are around there. We'll look at the older OVS without connection tracking, the new OVS 2.5 that does have connection tracking support. One that you may not be familiar with is Contrail View Router. It's an alternative to the OpenV switch, except it's using more routing-like constructs, BGP, IP forwarding, etc. We'll compare SRIOV to VertiO and similar solutions, focusing on flexibility and performance of the data paths. Then we'll have a look at how we can make things faster using SmartNICS, specifically evaluating the flexibility and the performance achievable using these SmartNICS. To make things yet more flexible, we'll introduce those programming languages P4C, but also a new Linux kernel construct, EBPF, adding flexibility into the Linux kernel, which can also be offloaded to the SmartNIC. Looking at various permutations and combinations of either doing the entire data path using these or by just injecting little pieces of code into existing data paths. Let's look at the traditional OVS model to get started. Typically you would have a NIC that does not do any intelligent processing. This NIC would then feed traffic into the OpenV switch subsystem. That's composed of the kernel data path, as well as the user mode agent. A typical packet would go first to the kernel. The kernel would notice, I don't know anything about this packet, about the flow to which the packet belongs. Would send it to user mode. The user mode would then push down a flow entry into the kernel data path, and all subsequent packets of that flow would get forwarded more or less directly to the virtual machine. OVS 2.5 is able to now integrate with the Linux connection tracking feature. So this is something that you would normally encounter attached to NF tables or IP tables. So OVS can talk to it directly now to have an implementation of OpenStack security groups, which is both stateful, still offers high performance. So once that first packet has traversed the data paths, the idea with the kernel module is to forward traffic directly to the VM without involving user space anymore. So in essence, then, the kernel is a fast path for the user space. The actual packets get sent to the VMs using VertiO or many other host-to-guest communication channels. The SIOV style of forwarding traffic to VMs involves really basic configuration driven from OpenStack, setting up things like MAC addresses and VLAN tags that are then used to pick to which SIOV instance certain packets get sent. Once that configuration has been completed, the X86 is not in the picture anymore, and the SIOV style NICs can send the traffic directly to the VMs. In return for this very direct path, though, you end up with very limited intelligence. You can basically only use simple criteria like the MAC address or the VLAN tag to direct the traffic to the virtual machines. So to start peeling the onion further here, with OVS, you have a few options for your control plane. These flow entries that drive the forwarding of traffic or the other policies associated with traffic are sent by OpenStack using the ML2 plugin to the compute nodes. An SDN controller like Open Daylight may or may not be involved. It's typically involved if you have some hardware switches as well or some more complex setups. New facility that emerged recently is OVN, which is a layer on top of OVS, where the OpenFlow protocol is no longer going over the whole data center network, but it's only local to the compute node with OVN protocol based on OVSDB replacing OpenFlow in the data center. But either way, the forwarding is typically done on layer two, layer three criteria. Even though it's built on top of OpenFlow match action tables, that's what's typically used. OpenFlow being there, there's fortunately some room for growth. If one wanted to, one could direct the traffic based on layer four criteria like TCP port, UDP port, et cetera, as well. In terms of overlay support, this solution supports a variety of tunnel protocols. And in some cases, they can even be combined. For example, VXLAN can live in VLAN, NSH can be inside VXLAN, et cetera. And for the security groups, as mentioned, from OVS 2.5, we have the contract interface for micro-segmentation. VRouter is very similar in its feature set. It's only the details that differ. With VRouter, you have a control driver. It's not using the ML2 mechanism, it's a full driver. L2L3 forwarding is done using dedicated forwarding tables. And there's also a separate flow table that's used for the security part of it. It also supports tunneling, a different set of tunneling protocols. The SRIOV, again, of ML2 plugin is used, there is some vendor-specific code in LibVert to support different variants of SRIOV-enabled cards. Forwarding intelligence, very limited, and no support whatsoever if you use the SRIOV method for tunneling or security. If this is necessary, people typically do that using an external physical switch, which obviously has a lot of additional cost and complexity associated with it. So let's now look at the accelerated approach. A smartNIC, like the Netronome Agelio smartNIC, is able to offload the entire OVS kernel data port. So similar to with the unaccelerated version, a first packet of a flow might get sent up to user space. When that user space OVS agent determines what to do with the flow, it pushes down a flow entry into the OVS kernel module, which acts as a cache for the user space and a fast path. That kernel module pushes the flow entry down further into the smartNIC. In addition to that, the smartNIC also has another level of fast buffing to even avoid the OVS kernel style match action lookups. So after this, you may also involve contract in the host site, which is similarly offloaded with the contract state being synchronized to the smartNIC. Once the first packet of a flow has traversed all of us, the flow entries are set up in the smartNIC and the traffic can then be processed autonomously on the smartNIC for all the remaining packets of a flow. Using this mechanism, the vast majority of traffic can be offloaded to the smartNIC and then actually sent directly to the virtual machines. So the traffic can go via SRIOV to the virtual machines, despite the fact that there's virtual switching happening because the actual virtual switch is pushed down into the NIC. So this therefore gives you the combination of the features of SRIOV and virtual switching. You don't need an external physical switch for this. Furthermore, there's the option of using VertiO to feed traffic to VMs. And this is very useful when it comes to, for example, via migration. V-Router, very similar. The details differ. It actually has a separate forwarding table, which is preemptively pushed down to the NIC. The flow table works similar to OVS where a miss causes packets to go to user space, causes flow interest to be installed. And once it's all set up, again, the traffic can go more or less directly to the virtual machines. VertiO and SRIOV both being supported. Looking at the performance, if we compare the solution to the baseline OVS running in the kernel, we see a substantial improvement in throughput. OVS running on DP2K does have a higher throughput, but the SmartNIC has got a higher throughput storm. However, look at this CPU usage. To achieve this result, the kernel and the user space OVS were each given 12 cores, whereas OVS, when it's offloaded, only needs a fraction of a core. I just rounded it up to a core here. That's because only that first packet in the flow needs to consult the host. All the remaining traffic is handled autonomously by the SmartNIC. So to visualize that a bit, in a networking heavy case, you may actually lose one of the two sockets in a dual socket machine just to network-related processing. Whereas with a SmartNIC, a fraction of a core is used. I'm freeing up the majority of the system for application processing, VMs, containers, NFV applications, whatever they may be. So if you would like to delve into this further and look for your use case, what the savings could be, please access this link and you can go and do your own ROI calculation or just Google for ROA calculator at Netronome. So to summarize the traditional unaccelerated approach, with SRIOV gives us high throughput but very limited expressiveness, typically, as I said, MAC or VLAN matching, not stateful matching, not very deep insight into the packets, let alone tunnel termination or any other actions that modify the packets. With SRIOV, one cannot migrate the VMs, so it becomes a manageability issue. OVS and Control V-Router unaccelerated give you high expressiveness but much lower throughput or they actually use a lot of the CPU cores. With a smartNIC approach, we get the best of both worlds. We get the benefits of SRIOV based data delivery to VMs while still doing the full switching. Netronome is also offering the express VIRTO feature, which gives you close to the performance of SRIOV while still being compatible with VIRTO drivers in the VMs. The accelerated OVS and V-Router gives you the same expressiveness as the baseline V-Router and OVS at a much higher throughput, yet with a lower CPU utilization. So we talk about the efficiency, which is the work done per server core. We actually have 50 times better efficiency because we do five times the throughput at a tenth of the resource. Let's say, however, that is not enough for you. You want additional flexibility that's not offered by the match action tables in these data puffs. So fortunately, the smartNICs have dynamically downloadable firmware and this firmware can even change while the smartNIC is running. By the way, there's a variety of port speeds available from 10 to 100 gig in these smartNICs. One way to exploit that is to use things like Contrail and OVS and simply offload them then they're a drop in replacement for the upstream versions. You could get more flexibility, however, by using something called EBPF, which takes effectively a little program, a little expression in the Linux kernel. That is compiled usually with adjusting time compiler to native code or for host processor and then it can run on packets, it can do lookups, it can do packet modification, etc. So it's very exciting, a new development is EBPF. Another term you'll see about it is XDP. The smartNICs can offload that and can actually run EBPF programs and just speed up the processing at that level. Further programmability is supplied by using traditional programming languages. I'm sure you'd be familiar with the C language. The P4 language is a new language specific to the networking domain, allowing you to very easily and conveniently do packet passing, matching actions and so forth. These can all be compiled into the firmware for the smartNIC and we can then combine the P4 or C derived firmware with OVS, EBPF, Contral based firmware. So since you may not be that familiar with P4, here's an example of a typical P4 program. You can see how easily one can declare header fields. This just declares Ethernet, IP would not be much longer and this describes also the parsing. One starts by parsing Ethernet, then it extracts the Ethernet header and in this case jumps to the ingress part, which is on the right-hand side of a slide, which proceeds to do table lookups. It applies the intable. This table now matches the ingress port and can invoke certain actions. So by putting appropriate entries into the table, you can for example say port 1 must go to port 2, port 2 must go back to port 1 and build a wire that way. So what you see here is a NIC implemented in one slide of code. Now try and do that in regular programming languages. You can very easily declare other protocol headers, also include them in matching. It can do exact match, range match, prefix match, etc. So the net of it is that we get a lot of additional flexibility by using P4, whereas OpenVswitch implements the OpenFlow standard that has a predefined set of protocols that it can work with. The OpenFlow has a 40 tuple, so 40 different protocol fields are defined. That's already a lot, but if you want to add more, you need to either do a proprietary extension or work with a standard's body to go and extend the standard. The standard also defines the match action behavior and the overall control flow. So for example in OpenFlow, you get a bunch of flow tables in the pipeline and then at the end you would have maybe a group table to do things like load balancing. And if you want to do the load balancing in the middle, well, you need to wait for a new version of the standard. In the case of P4, we actually start with a blank slate. We effectively have a switch that has nothing in it and the language can go and declare where we want to do the parsing. That's the yellow block. In other words, which data plan protocols we want to support. We can declare the green match action tables, the queuing and scheduling, et cetera, and permute them in any order. And that does require therefore a new concept, which is a data path program that needs to be written, that needs to be compiled with a front and a back end compiler and downloaded into your switch. So for something like OpenStack to be based on this, this would be a new construct that would need to be supported in OpenStack. These programs would need to be managed and sent to the compute nodes, network nodes, compiled, upgraded, et cetera. But in return we have a lot of flexibility by being able to in the language express exactly what we want. Some things you may want to do are telemetry to debug the network or do advanced monitoring, support new tunnel protocols, maybe add some security features, et cetera. Instead of using the regular open flow from an open flow controller, which is then plugged in beneath OpenStack, you would have probably some form of central controller managing the networking in the data center. But it would need to be an extended version of open flow because to add support for dynamically definable protocol fields, you would need that. It's to be determined whether one would use this also to go and download the program or whether you would send the program to the nodes in some other way. So let's have a look at how we can use P4 or other languages like C to extend these data paths. One simple way is using a plug-in concept. We also call that a sandbox concept where you use an existing data path to do your first level of matching and then decide using one of these rules called flows in open flow to send the traffic to the sandbox. So you may, for example, say we want to implement a custom tunnel protocol. And we can match UDP port 1, 2, 3, 4, if it's that UDP port, then send it to the red plug-in for detunneling or maybe send all traffic to some block for statistics gathering. These data path extensions could be written in this P4 language that we just saw. They could also be implemented in C or EBPF. We could also have them both on the host and on the SmartNIC with the SmartNIC again offloading the host. What do we gain by this? We have quite a bit more flexibility in the action area. It's easy to filter or modify packets. It's a bit harder, though, if we want to use these plugins to go and do classification. Think a little bit about how you would do that. You would need to have a table lookup and then feed it to maybe that P4 block for additional constraints to be determined. And what if that additional constraint is not satisfied? Now it needs to go back to the original table lookup and get the next highest priority match. And feed it again to some C code or P4 code to do further refinement. So that's kind of hard, but the scenario I described where you do the outer protocol with something like OVS and the inner protocol with some of these new mechanisms, that's much more feasible. In terms of how to integrate all of us into OpenStack, if we model the additional functionality as an additional port, then we can use the existing SDN controller like Open Daylight, existing OpenStack infrastructure. And we just need to know, we need to feed it not to a physical port, but to a logical port representing that functionality. If we elect to make it a custom action, we would need to change the open flow action list to be able to have these new action types in there. What though if we want to go and replace the entire smart data path with a data path written in these languages. In this case, we'd write it in P4 as far as the matching goes, because that's very convenient. The action maybe we want to throw in some C code, P4 can also do some actions. And now we offload the OVS host data path similar to what we did before. In this case, it becomes quite easy to program the smartNIC using these languages to do the offloading. And you could easily do custom behaviors there, but since we did not touch the OVS C code on the host, we are limited to what OVS supports. Integration will be very minimal effort if it's just doing what OVS does today, but if we are enhancing OVS and adding support for new protocols or new behaviors, then we will want to create that open flow plus plus that we saw in the other diagram. What though if we start to replace some of the blue OVS code with the burgundy P4 code. In this case, more substantial changes are needed to OVS. But in return, we get all the features of P4 where we can more easily add new data plan protocols, new behaviors, etc. So here there's unfortunately some fine print because in theory, it's very easy to go and implement all these new things in P4. In practice though, if you look at the detail, open flow matching is in fact more flexible than P4. Because with open flow, you can at any time go and replace any table match field, add a field. Let's say you have a table matching ethernet, you can suddenly start to match IP in the, or TCP. You can at any time introduce a new action as a consequence of a table field whereas in P4 all these things need to be pre-declared. So in order to solve that issue, one can regenerate the program on demand. Once you see a new pattern of behavior being introduced, you would go and recreate a P4 program to correspond to that. Alternatively, a little bit more practical is to pre-define what you want in terms of the maximum flexibility you need. And then just return errors if incompatible requests are sent from the cloud management system to the compute nodes. So this is a considerable effort, but it's already been done for the OVS host part. There's the so-called Pisys project by Muhammad Shabas. You can get more details at that link. There's an independent P4 implementation for the smartNICS that one can get and play with. They just haven't been connected together yet. So once that infrastructure is in place, it will hopefully be much easier to add new data plan protocols. They're talking about an order of magnitude less code to add a new data plan protocol. What though, if we say we get rid of all the last remnants of OVS and we just replace the agent in user space, also with a new agent. Now we have full flexibility. We don't have to use any of the existing things like the OpenFlow protocol, the OVN protocol, etc. We can basically make it a remote procedure call from the cloud management system over the network into the compute node, or define a new protocol to go and push the policies down to the compute nodes. However, we now run into the issues that I mentioned. Who will go and download the programs? Is it OpenStack? Is it another system? How do we pick which node to run a VM on? Maybe a certain program needs certain acceleration hardware to function properly or needs some other resource that varies between the compute nodes. So now we need to go and modify the Nova scheduler to go and pick the right place to run this kind of networking infrastructure program. The discussion started in the Austin OpenStack meeting around acceleration. I think it's still in its infancy but at least it's being worked on to try and grapple with these issues and define the right constructs to support them. So we get in this case a lot of flexibility, easy to implement new behavior, a completely different control plan being possible. But conversely, a completely new control plan is required. We're starting from scratch with developing all of this and therefore that's a lot of work initially. But eventually it has a lot of benefits. Just one note to consider here, people are not only using open daylight and open flow for the virtual switches. They do use them to configure the physical switches, maybe the wire gateway or top of rack. So if we do go and replace everything with a new system, we will need a network protocol to go and cope with these older devices, physical boxes as well. So to summarize, there are many ways to just plug in smartnicks into existing data planes, whether they be OVS, Vrouter, EBPF, and you will see significant speed up of the data path, higher throughput, lower latency, etc. Significant offload to free up the server cause for application processing. One can go and program the smartnicks in P4 or C or a combination. There is work for us to go and evolve for standards to make all of us flexibility possible. The plane OVS, integrating OVS into OpenStack, integrating accelerated OVS into OpenStack. That's underway, there's still some steps needed. But when it comes to having P4 embedded in OVS or replacing OVS, a lot of work is needed there. Please, for example, join the discussion on a runtime API. This is currently happening in the Open Networking Foundation, but we're thinking of moving it to another body and actually having a cross body effort to host this discussion. In general though, all these efforts are very much worthwhile to have increased flexibility in the network while having higher performance and freeing up the server resources. So I think we have a few minutes for questions, if there are any. There's been a little said about latency. Did you perform any tests regarding all the configuration? We expect that latency will be lower, but. Yes, we did do some tests of latency. We actually found a tenfold reduction in latency in many cases. So from latency that can vary considerably because the moment the Linux scheduler is involved, it can potentially cause long spikes in latency. And if you pin things to a core, you can reduce that a little bit. But the fast path through the SmartNIC is just tens of microseconds of latency. So it's a huge benefit for that. And in fact, sometimes people deploy SRIOV in order to get higher throughputs. But sometimes they just deploy to get the lower latency. And with this way, you can do the virtual switching and still have your low latency and high throughput. Maybe you touched on that and I missed it, but in terms of hardware, what is different in a SmartNIC compared to a traditional NIC? I didn't actually spend enough time on that. The SmartNICs that we were referring to here are actually heavily multithreaded multi-core processors. So they have 72 to 120 cores on them, eight threads per core. So that gives you a huge amount of parallelism and a lot of cycles available to do all these processing tasks that are necessary. But they also do run to completion processing of packets. So therefore, nothing is fixed in terms of what to do per packet. You just get a packet in, you go through the steps like passing, matching actions. You can go and repeat them. More passing of an inner header, internal, detunnel, QoS, etc., it's just all up to software. Furthermore, these SmartNICs actually have DRAM on them. Gigabytes worth of DRAM, and that DRAM is used to store lookup tables or state tables to do things like the connection tracking, etc. So yes, there's a lot of capability in them. That's why they call SmartNICs and not just somewhat intelligent NICs. And that's really what makes all of us possible. One of the things about OpenStack networking is that when you want to go out from the VM to outside world, you typically have to use the network node. For doing the routing, and of course the way to avoid that is to use the DVR system. So have you thought about implementing DVR with the SmartNICs? So you would do forwarding directly from the VM to outside world? Yes, I think there's enough capability to basically do what even is necessary just by pushing the right policies down to the compute nodes. And there also is a question of capacity in terms of how many table entries. Can you support and how quickly can you update the table entries? Let's say if things move around, you need to go and also migrate the table entries that go along with the virtual machines. I guess it's really hard for me to comment on specifics without going into more specifics of your question, but in general it's possible it's supported in terms of the capability being there. And it's just up to the cloud management system to deploy the policies appropriately to get it done. So can we say this SmartNIC is made from the CPU. So the CPU actually was processing the package in that hardware. I'm not fully understanding your question. I'm just wondering how do you implement the SmartNIC? You used the CPU I heard before. It's quite of the main core inside the SmartNIC. Yes, so basically normally the open V switch or V router would run on the X86 and it would have a user space component talking with the cloud management system and it would have a kernel component focusing more on packet forwarding. What we would then do is go and write software that's equivalent to that kernel data pop that does the same kinds of matching, same kinds of actions with the tables being synchronized. So it also matches against the same tables. In terms of the physical implementation, if that's what you're asking, it's basically a mini core system with many threads. It looks like a RISC system with many, many different cores and that just runs the software. I think we're out of time so I'll need to let you go but please grab me during a break if you have any further questions. Thank you.