 Hi, my name is Rajasiv Ramakrishnan. I'm from Juniper Networks. And today, I'll talk about how we've integrated Contrail, which is our network virtualization product with a SmartNIC from Netronome. This is a joint talk with Chris Telfor, who's from Netronome. So Contrail allows you to create virtual networks on top of the physical network infrastructure so you can provision new tenants, new applications, without having to configure VLANs or security policies on the physical switches. And the way we do that is by using overlay tunnels. One of our goals was to not invent any new protocols or encapsulations. We wanted to use standard protocols and existing encapsulations. So we support MPLS over GRE, MPLS over UDP, and VXLAN. And one of our goals was that there should be no performance degradation as a result of adding the overlay. So I'll give a quick overview of where we started from, and then hand it over to Chris for the newest work. So initially, we had the vRouter, which is the virtual router. That's one of the modules in the Contrail solution. The vRouter used to live as a kernel module inside the host kernel. And we had some user space components, which was the NOVA agent from OpenStack. And we also had a Contrail user space agent. And we had to make some changes to improve the performance of this because when we first started off, it was quite poor, actually. So we made changes to LibWord, for example, to allow it to work with the vHOST net process in the Linux kernel because that was only supported with Linux Bridge and OBS initially. And we also enabled RPS, which is to see packet steering inside the vRouter module. So that allows vRouter to use multiple CPU cores. And we also enabled GRO, which is generic receive offload, so we could coalesce multiple packets arriving on the wire into a single large packet before sending it to the VM. And on the transmitter, we enabled segmentation offloads. And because most Knicks are not able to do it with MPLS, we had to do it in software. So once we did that, we were able to get line rate on a 10 gig link. But for a packet processing applications in NFV scenarios, the more important consideration is how many packets per second you can process. And what we found is that with the kernel module, we can only do about half a million packets per second between two VMs running on different servers. So that was our motivation for integrating vRouter with DPTK. So this is the same solution, but instead of vRouter running inside the kernel, it now runs in user space. And it's integrated with the DPTK library. And the VM is now mapped into the address space of the vRouter process. And we used the vHOST user functionality inside QEMU, which is pretty much standard now with QEMU. And with this, what we see is that we can handle much more packets, many more packets per second if the VM is also running a DPTK application. If the VM is running a non-DPTK application, we still have the same issues with VM exits. And because of that, the packet processing number is pretty small. So in the next slide, I'll present the numbers that we see today. So we're able to get line rate on a 10 gig link. And we've seen with 2 by 10 gig lag, we get about 17 gigabits per second. With the kernel vRouter, we only get about half a million packets per second. But with the DPTK vRouter, we get upwards of 12 million packets per second. But this comes at the expense of CPU. So it takes us six to eight CPU cores to achieve that kind of performance. So what would be really nice is if we can get the performance of, say, an SRIOV interface together with the flexibility that an overlay provides, which is where the integration with the SmartNIC comes in. So in our upcoming release, we've now moved the vRouter into the SmartNIC. So packets are processed. All the overlay processing happens on the SmartNIC. And with that, we get much better numbers. And I'll hand it over to Chris Stelfer from Netronome to talk about that. Thank you. So I am Chris Stelfer from Netronome Systems. And we make a network processor, which I've shown a diagram here on the right-hand side. And that network processor has basically a lot of offload facilities for handling the embarrassingly parallel task that is processing packets. And we put this on a series of cards that we call the Agilio line so that you can have a programmable data path in data centers. And so when we first engaged with Juniper, we basically set out to accomplish the following. First of all, we wanted to try to offload the vRouter data path onto the NFP card, which is embedded in the NIC. We wanted to improve the overall processing and switching of the system. But we also wanted to make sure that we maintained feature parity with the existing contrails system. And make sure that also the existing interface to using contrail looked exactly the same. Ideally, we wanted people to be able to use our hardware within their data center and not know that anything out of the ordinary was happening other than the network was faster. And finally, because OpenContrail is a software-defined networking product, we wanted to make sure that this code base also is something that we could iterate on. So as contrail evolves to support new features, new protocols, whatever is needed, our code base is also able to evolve, and we can iterate with them. So our design approach sort of centers around a few key points. First of all, we worked with Juniper to upstream a hardware offload API so that as the vRouter data plane gets configurations from the host and from the controller, we are able to mimic those configurations down to the SmartNIC transparently. We didn't have to change anything in the format of the configuration messages that contrail was being provided, or that the vRouter was being provided. We just basically received a function call that we said, oh, OK, they're adding a route. So we'll add a route. Oh, they're adding a flow. So we'll add a flow, and so forth, and so on. And so if the contrail vRouter data plane doesn't have any hardware acceleration, that call is basically a no-op, and it doesn't realize that whether anything is being offloaded or not. Next thing, of course, is we wanted to mimic the most crucial parts of the data plane down in the NFP processor, as I already said. And to improve the performance overall in our architecture, we have the NFP deliver traffic directly to and receive directly from the virtual machines in the system. And we do this in two ways. One, if the virtual machine supports our upstreamed SRIOV drivers, then you can do direct PCIe assignment so that we can DMA traffic directly in and out of the VM's address space. For other VMs that are using traditional vertio drivers, we actually have a thin proxy that we call xvio, or in this diagram it's called the vertio relay daemon. And that thin proxy is basically able to basically copy the descriptors from our SRIOV driver into native vertio ABI. And finally, the last thing that we had, the last big piece of this puzzle that we had to put into place is that we wanted contrail, as I said, to not really have to know whether or not the special hardware was there. So we created what we call representative interfaces, which are net devs in the Linux kernel, that make the vRouter data plane think that it is still talking directly to interfaces in the virtual machines and physical hardware. But in fact, those representative interfaces just are proxies for sending the traffic via the NFP. So just to show you briefly how this all comes together, the first packet of a flow follows a path something like this. The first packet arrives in the NFP, and the NFP does a look up on the packet and sees that there's no flow associated with this in the flow table. So it misses, and it lets the packet fall back to the regular vRouter data plane. The regular vRouter data plane is going to think that that packet just came in off the physical wire, and it will do a flow lookup, and it will also miss, and it'll say, OK, I need a new flow disposition for this packet, so it lets it fall back up to the host into the contrail agent, who will then perform appropriate access control, mirroring checks, and so forth, and determine, OK, what should be done on this? Should it be netted, and so forth? And that gets programmed back down into the flow table for the vRouter code base. And then, as I said, we have a hardware offload API so that when that happens, we also transparently program that flow down into the NFP. Eventually, the vRouter data plane says, OK, I'm going to release this packet. I'm going to send it, in this case, out virtual interface 0. And as I said, that goes through the representative interface down to the NFP card, who then delivers it to the appropriate virtual machines interface. OK, so far, so good, but we haven't actually made anything go faster. But the magic then comes from then on. Every other packet that's coming in into the system gets switched in process directly in the NFP data plane, goes directly into the virtual machine and out of the virtual machine from there, and now you basically are able to move traffic at quite high rates. So just to give you a little sense of the results of this, with 200 byte iMix, we were able to saturate a 20 gig blink with no drop performance, which is at 20 million packets per second, which is basically the line rate when you consider the tunnel overheads involved. And for a full blast, we can deliver 27 million packets per second and out of the system. So we continue to hope that we are going to evolve this system as ConTrail evolves, and we're going to be improving it to enhance container support. And we're going to also look at the notion of, instead of delivering these packets through VNFs, maybe we can offload the VNFs as well to the SmartNIC and even make the system even faster for VNF use cases. So thank you very much.