 Hi, so this talk is a very short introduction to NetMap and how you can use it to implement VitoNetwork functions. So what is NetMap? We have been talking about DPDK, pf freeing, xdp and so on, so NetMap is just yet another independent API for direct access to NIC, transmit and receive functionalities in user space. So this case is the very same as DPDK and pf freeing and so on. The idea is that you have a NIC and you open that in NetMap mode and once you do that you are able to temporarily steal that from the NetOstack and use that with a very batch oriented and efficient API for fast application, networking application. So it's very important to know that this is implemented within the operating system kernel. For instance, different tree from DPDK. Should I louder? Louder. Okay. Okay. Sorry. This is implemented in the operating system kernel and we will see why this is important later. It's included in FreeBSD and also in NetMap, also in Linux as an auto tree kernel module. So these are the design principles behind NetMap. I think those are important because of the very same design principles also behind the things like DPDK to some extent and DPDK, pf freeing and XDP. So the first and most important one is budget operation. So the idea is that whenever you want to talk to the NIC, for instance, to transmit packets and receive packets, you need to do that. You need, for instance, to tell the NIC to transmit many packets at once because talking to the NIC is expensive and in general, when you do packet processing, whenever you have fixed costs like locking or lookups, try to, for instance, take a lock and process with that many packets. This is important because the fixed costs can be amortized over many operations. The second principle is preallocation of packet buffers. So in essence, try to avoid dynamic allocation of packet metadata like, for instance, in Linux, you have a SKBuff that you need typically to allocate and deallocate SKBuff metadata structures for each packet that you process. Third, zero copy access to packet buffer. So the idea is that your application should be able to directly read and write packets in each other's space and the NIC can DMA packets, for instance, directly in the application address space. So you don't need the traditional copy across user kernel boundary. Fourth, kernel provides the protection. It means that your application cannot crash NIC, your application cannot crash the system. So for instance, the application cannot direct access to NIC registers and rings. This is very important because all the protection and isolation you need is provided directly by kernel. And last, you must have the possibility to use full synchronization. So for instance, frameworks like DPDK rely on busy waiting, even if there are other options, but with both NETMAP and NGCP, you are able to actually use standard synchronization means like pole system calls, select system calls, you can wait for packets to come or wait for egress space, more on this later. So the data structures used by NETMAP are very simple. There is a NETMAP interface, which is just a bunch of pointers to NETMAP rings. Rings are basically an abstract representation or other cues. So you can have one or more receive rings or one or more transmit rings for each NETMAP interface. And a ring is just a secret area descriptors with producer and consumer indexes. And all of these data structures are contained in a so-called NETMAP allocator. So the idea is that you may have multiple NIC ports on your machines, and a single NETMAP allocator may serve more than a port. So the allocator is a domain of trust, meaning that the only applications that are working on the same allocator must trust each other. If you don't want to trust, just use separate allocators. And the basic idea is that in order to access those data structures, so basically rings and buffers, you want to open a special device and then you can use an NMAP operation to make those data structures available in your application. So a NETMAP ring, as I was saying, is an abstraction of a real ring, the real hardware ring. So what happens is that applications operate on the abstract ring. And then they use a special sync operation to sync the state of the abstract ring to the state of the hardware ring. So there are two pointers. This is the abstract ring. There are two pointers here, add and tail. So the meaning of this pointer is that everything between add and tail is owned by the application. So for instance, for receive rings, those are new packets that are ready to be read. For transmit rings, it's free space that you can use for new EGIS operations. While the rest of the descriptors in the ring, so everything between tail and head, it's owned by NETMAP, it's owned by the kernel. This is an example of how we would process a receive ring. So say that your application has many descriptors available, many new packets that you can read. It can, for instance, process seven new packets and then increment the add index, while the tail is just read only. After incrementing this index, it can sync. There is a special IO control call to sync a receive ring. So what happens? We have two effects here. So everything between the previous position of add and the new position of head is just returned to the kernel. It's returned to the system for reuse so that it can be used for receiving more packets. And if any new packets arrive since the last time we synced, tail is incremented accordingly. So in this case, we received three new packets. So this is a very simple synchronization between your application and the NIC. Okay, a very important thing that was mentioned before is blocking versus busy waiting. So the sync operations for both receive and transmit rings are synchronous non-blocking. And basically, they operate on all the rings that are bound to a specific NETMAP descriptor. So the basic idea is that you open a NIC in NETMAP mode binding a certain ring so you can bind just the receive rings or just transmit rings or everything, whatever you like. And once you use sync operations, the sync operations operate on all the rings you bound. And you can use sync operations to implement busy waiting. So if you don't want to block. But this is actually not the usual way you use NETMAP because you may want to actually block, for instance, waiting for more packets to come or when you want to wait for more space to transmit. And you can use the pulse system call, select system call, or on Linux even E-Poll or K-Event. This is supported. So if, for instance, you want to wait for more ingress space, you will do the polling event. So this is just a standard synchronization. It's very similar to what you would do if you were using sockets, right? So far, I've talked about NICs, so hardware. But actually, NETMAP supports many kinds of virtual ports. And virtual ports are important because they can be implemented, they can be used to implement very fast local IPC communication. So for instance, we have zero copy pipes. The idea is very similar to UNIX pipes. So you have two ends and you can let two processors talk, communicate through the pipe. But the point is that you can use NETMAP pipes using the NETMA API. So you are able to transmit and receive packets in batches, which means that you can be very, very efficient. And also it's zero copy. It's zero copy because you can just swap the script as I will show later. And this means that you can have, independently on packet size, you can have very, very fast communication over 100 mega packets per second. Of course, this is a benchmark assuming you are not touching packets, but still it's an interesting upper bound. We have also software switch, which is designed for virtual machines. So by definition, you want isolations between two virtual machines, which means that the switch must copy packets when transmitting from one port to the other port. And because again, because of the NETMA API, the ability to work in batch, you are able to actually transmit 20 million packets per core per port. We also support monitor ports for sniffing. So similarly to what it's sometimes you have a network application using some ports and you want to see what's happening from a separate process. You want to sniff traffic. You can do that with a special port. Today, I'm going to talk mostly about the last one, which is pass through port. So the idea is that you have a NETMA port in the host machine. It can be a NIC. It can be a port of a switch. It can be a software switch. It can be a pipe or whatever you like. And I want to export this port within a virtual machine. So the idea of this is the basic idea of NETMA virtualization, where you want to run your application within a virtual machine. So this is possible with NETMA pass through. Okay. There are two main use cases. So you have a KVM guest, KVM virtual machine. If you pass through a port of a ballet switch or the software switch, this is very interesting to implement a very fast local inter-VM networking. So think of two VMs on your machine that are able to exchange up to 20 millions packets per second and minimum packet size. So that's pretty impressive if you want to implement some sort of fast packet processing application in your machine. Or you can pass through our report. In this case, this is a sort of direct assignment that you can, of course, implement using standard PCI pass through techniques. But it may still be interesting because you can implement direct assignment without IOMMU support in the advisor and without actual support for PCI pass through. So it's just a different way to do the same thing. From the point of view of the guest, the guest operating system sees a virtual nick. Okay. And the virtual nick has the very same configuration as the underlying NETMA port. So if you pass through a hardware nick with eight receive rings, you will see a virtual nick with eight receive rings within the virtual machine. And again, there is no overland in terms of copying because the guest has direct access to the buffers and the rings of the real port. So you can do basically zero copy from within the virtual machine. Any sync commands are basically forwarded to the host. So this is an example of application that you may implement with this system. It's a very simple two ports application. So we have an external port. Think of it as a public port on some network and an internal port. So you want to forward packets from the receive rings of the external port to the transmit rings of the internal ports. And the other way around. And when going from the external to internal, you also want to apply some rules. So I don't know, depending maybe on destination IP or destination port, you may want to drop some packets or select some packets. While on the other direction, you don't want to filter. How do you do that in a few lines of code? So this is the main synchronization logic. First, we open two ports. Okay. So the internal port and the external port using a simple, very simple library. And then we have a simple poll based loop for loop. Okay. So this is the poll. So here we have two ports. So we have two file descriptors, one file descriptor for each port. What we need to do in this simple forwarding application is to decide which events we want to wait for. So the logic is very simple. Let's take the external port, for instance. So if we have no packets ready to be received on the external port, what I want to do is to spoil in to wait for them. Okay. On that file descriptor. Otherwise, it means I have, I do have packets. So what I want to do, since I want to forward on the other port, I just wait for egress space on the other port. That's why you spoil out on the second port. And this is specularly for the other direction. So if I have some, if I don't have any packets ready to be received on the internal port, I wait for them. Otherwise, I wait for egress space on opposite port. And then I call the poll function. When poll returns, it means that some events are ready. And so I can forward. And I forward in both directions, right? From the external to internal, and from the end the other way around. This is the function that implements zero copy forwarding. It's interesting. What I'm doing here is just a parallel scan of two rings. So what I want to do is I have a receive ring from the source port and a transmit ring of the destination port. I want just to forward a bunch of packets from the receive ring to the transmit ring. So I do a parallel scan. And the nice thing is that with Netmar, with Netmar, you can implement zero copy forwarding. So each descriptor within the ring, as inside a buffer index. And the buffer index is the identifier of a buffer within an allocator. So all you need to do to forward from a ring to the other is just to swap the index of the two descriptors. And this is what I do here. And adjust the length, of course. I also need to tell Netmap that the buffer has changed because it may need to change the DMA mapping inside the kernel. But otherwise, what I wanted to show here is that you can implement some simple forwarding rule in a very elegant way and simple way. So this is the example you can run. I have a QMovital machines and two pipes. So I could have implemented a different example with other ports, but pipe is easy to do because it's all in software. So I have two Netmar pipes. I pass through one end of each pipe to the virtual machine. So the virtual machine, which is a QMovital machine, will see two pass through Netmar ports. While the other end of the pipe are used for packet generation. So here I generate a steam of packet. And on the other end of the pipe, I receive. And so what I measured here with short packets, so 64 bytes, I measured about 17 to 20 million packets per second, which is pretty impressive because consider this application is implemented with just one thread. While with full M2 sites, so 1500 bytes, I get about eight. I actually tried to both zero copy and copy. And it's interesting to see that in the copy case, actually for very short packets, the overhead of changing the DMMA mapping, because you need to do that, is actually higher than copy. So with short packets actually makes sense. And this is actually the same that Luke was saying before. So with short packets, actually doing things in CPU, copying, once you have something in cache, you can very easily process them. Okay. I also prepared a very short comparison between Net and IDBK. There is no time actually to go through this. But what I wanted to show here is some comparison item. So one advantage of Netmap and also XDP actually is that it's very easy to set up. So with DPDK, you need to care about huge pages, you need to care about IOMMUO, you need to bind and unbind drivers from the kernel driver to the DPDK driver. While with XDP and Netmap, you just have to do nothing. Actually, with XDP, you need some very small EBPF program to redirect your packets to an AF XDP socket. Also, the other advantage that you get by reducing kernel drivers is that you can reuse the standard EPRO2 and ATH tools. While in DPDK, of course, you need to rewrite adopt tools. Also, the trading model is a little bit more flexible because with DPDK, you have a fixed, you have L cores. So you write your code and your code is running in the context of a DPDK callback on a core. While with Net and XDP, you can AF XDP, you can basically open AF XDP sockets or open Net and ports wherever you wish and run your packet processing code in any thread. Another advantage again of Net and over DPDK is that you can, and this advantage is actually shared with XDP, is that you get standard synchronization tools, so PALL, E-PALL, and SELECT. Actually, with DPDK, you can use receive interrupts, but that's a bit harder than just using the standard system calls. Of course, DPDK is an extreme performance, so when I prepare this comparison, it's very clear is that if you want the best performance, you must use DPDK because with Net and AF XDP, we are still using system calls, so that as an overhead and come through the advantages in terms of improved isolation, standard synchronization. All you want is performance, you should use DPDK. Okay, conclusion. I showed you a very simple example of how to write a simple but efficient Net from application. I think the design principles behind Netmap are important. They inspire XDP because in the comparison, it's very evident as many choices taken by Netmap and XDP are similar. Why do you really want to use Netmap? It's the biggest advantage, I think. It's easy to set up, very simple API and standard synchronization. Of course, it's a smaller project than the other project, and it's easy to integrate with existing application. Okay, well, for instance, with DPDK, you usually need to write your application from scratch and make your application fit within the DPDK framework. Of course, you can get higher performance. If you want to reproduce the tutorial, this simple setup, you can just follow the tutorial's link and there are detailed instructions on how to reproduce that, all the code, and get those numbers. Thank you. I'm ready to accept questions. Questions? Well, thank you so much. Thank you. The next presentation starts. So just use the opportunity to remind people that could you please leave Vincenzo feedback through the cost app website? There's a link at the end of the talk. And also, I will pull up the meet-up this evening at the main game, his bar on the first floor at 7 o'clock. Yeah. Yeah. If you have anything, can you please exit from the back of the room? Are you working with the author or...?