 Well, so I feel very strange because I woke up at 3.30 this morning to go from countryside of France to here, so I really feel jet lagged. So I hope you don't feel the same, and at least I'll be able to be on time. So please help me, because I tend to be a little bit chatty. For those who do not know, Linaro is the arm ecosystem collaboration. So we have companies like Google, Facebook, Alibaba, Red Hat, Cisco, Ericsson, Huawei, Nokia, Kavium, NXP, TI, ST, well, I stop. So many of those companies are either building silicon or building systems on top of that. And we collaborate inside Linaro to produce things and to push technology in many different communities. And out of the members, there was a request to have a technology to sustain time-sensitive networking. We tried to do it with the kernel, with 500 MHz, 1 GHz CPUs. And the result is that the latencies or jitter are not really consistent with what the applications want. And when you have, when the latency is between 20 and 40 times what it should be and when the jitter is 200 or 600 times what it should be, even though we try to optimize, it does not look reasonable to think that we're dividing the figures by a factor of 50 on average. So that's why we had to find a solution on New Zealand. And then why not ODP, DPDK to handle that? Well, in the automotive industry, people have solved the problem for quite some time. And maybe they already have a TCP IP stack in New Zealand and access the hardware through proprietary systems or proprietary methods. So why not having a standardized method that is valid for everyone that is also consumable by DPDK or DP or even VPP directly? For example, one use case for VPP would be to have VETH access, direct VETH access for container networking and be able to have a very high performance. So we wanted to have a very generic solution. Now, zero copy, so why zero copy? If you look at the 100 gigabit adapter, that's 148 million packets per second, that's roughly 15 gigabyte per second and that's roughly 1 DMA channel. So if you have the ring descriptors along with packets plus what was just described as the virtual descriptors for an abstraction of that and then you copy the packets. Basically you have all the four channels of the CPU package that are fully used for that. So then for VPP you have zero memory bandwidth to look for the routes and if you have 1 million routes then you really have a problem. So for the design we want to make sure that even for 100 gigabit we still preserve memory bandwidth for the real application. Secure, so in the past we've been using UIO or things like that or VFIO, no MMU and in the context of specter and meltdown I would say that we probably want to make sure that the memory is a very well protected subsystem. So let's not do any other way than IOMMU. Usional network IO, so network IO is not about building a device driver, it's about just getting the packet queues and the packets themselves. Let's not initialize the hardware. If you look at the code size for a real tech adapter it's 10,000 lines of code and out of those 10,000 lines of code you may have 8,000 lines which are about the different initialization procedure for this particular flavor or this particular hardware revision and you don't want to replicate that whole thing in DPDK or VPP. So what you want is to let the kernel do what it's good at, at driving the devices but let the user now capture just the data path. And if we want to have dual stacks which means that one port can be used by the kernel and a user line application. So the very idea is you have net devs in the kernel that have rings and packets and through some communication what we want is to be able to have the rings accessible or visible in the user line and the packets also controlled through user line. But what we want also is that if you use normal if config or other IP routes to commands that they can influence the user line application. And if the user line application wants to change the MTU it will just use the equivalent of the net link interface or whatever IOCTL you need to change that. So that's the design goals. One way to implement was just described. Use FXDP and that's a very good solution. Now we have another topic which is if you go really high speed on the network what about storage? What about the other accelerators for crypto, compression, pattern matching? If you have a solution for network but you don't have a solution for the other aspects of the life cycle of communications then you may just solving a part of the problem. So we think we need a solution that addresses IO in general from user line in a very generic way that can be applied to storage blocks, to crypto acceleration, to compression etc. And also that is able to handle all the IO models that you can find on earth. And let's talk about the IO models a little bit. So that's a typical meek. So you have descriptors and you have packets. Usually everyone thinks of this model only. So you have a bunch of buffers, fixed size buffers, let's say two kilobytes and that represents the packet memory. And then you have the descriptors and before being able to receive packets you have to state in each of those descriptors where are the different buffers. So that's the typical model. In that model on a two megabyte huge page you can squeeze 1,000 packets, 64 byte packets or 1,500 byte packets. You can squeeze only 1,000 packets. Now there are other methods which is common in our mesocese environment. You have these descriptors but you may have multiple packet arrays. This is not about queues where you have RSS and when you want to distribute the load between different queues that's a different topic here. This is about having let's say packet cells that are 128 bytes. You have packet cells of 256 and packet cells of 2 kilobytes. What this means is that you can have multiple packet arrays and for a single flow you can have multiple packet arrays which means that if you're doing an IPsec of load in software or in hardware for a single tunnel you can have multiple packet queues. Well, sorry, multiple packet arrays that can be handled by different CPUs or different hardware blocks. So this for example in the AFXDP is not yet supported but maybe that was the discussion on the mailing list. Maybe the model can be extended to support that. It's yet not fully known. There is another model for packet reception and that's the tape model. So you have the descriptors. That's the only common thing in all those packet IO stuff. And then you have a non-structured memory area. You have no notion of a packet buffer at all. So all packets can or just put one after the other or between let's say with some placement rules. But the hardware decides exactly where it wants to place the packets. So in that case for example for a 2 megabyte area you can squeeze 32,064 byte packet which is a little bit more than 1,000. So those models why do they exist? This model is essentially to be able to have parallelism on a single tunnel. This one is to be able to beat the DMA transaction bottleneck that we have on PCI. On PCI on the Gen3 times 8 you have roughly 42 million DMA transactions per second. Regardless of its arm, intel or whatever that's a PCI limitation. Which means that if you have one packet equal one DMA transaction per second your limit is 42 million packet per second. So if you have a 50 gig line card which is in theory 74 million packet per second if you use that model you can't reach line rate. It's not because your software is not good. It's the fact that the DMA transaction limit the amount of IO you can do. So that's why those models exist. For transmit we have the same stuff and you have ways to again to beat the DMA transaction limitation on the output side. So if packet was good but we need more than just networking and we need to be able to support all the IO models. DMA buff has been around for quite some time but it's very well done for large buffers that were even more than 4 kilobytes. But not for 64 bytes packets and 148 million packet per second. If you use VFIO natively you lose the net dev support in the kernel. So we thought that VFIO Mdev which has been introduced by Intel to support virtual GPUs in QMU was the right underlying framework to support user land IO. For not only networking but also for other accelerators. And by the way the VFIO Mdev route has been well is being investigated by Intel which is Sun Ming Liang and Ren Hat on one side. And Huawei is working on leveraging the same VFIO Mdev for crypto device support. So one thing we ended up doing is we separate the packets and the rings. We say that the rings are the entities that are managed by the kernel. Creating a ring, creating a queue can be very complicated depending on the hardware. So we don't want the user land application to deal with that complexity. At the same time also when you transition from kernel to user land you don't want to have an undefined ring. You may have an empty ring but you don't want to have an undefined ring. So the idea was to keep the life cycle of the rings inside the kernel and just bring the packet memory handling into user land. And that's what happens for the life cycle of an application that uses NetMdev. So first of all we want to make sure that there is a limited or zero difference from a code perspective when the NetMdev is activated or not in the kernel. So if you have a four wheel driver, if you load it, it will behave exactly as if the additions, the patches were not there. But if you have this global enabled parameter then there is a little bit of change in the code path. What the change do is essentially make sure that we don't create security issues. So we make sure that the rings are page size aligned and a number of things like that. Once the driver is loaded we have to capture the queues in user land. So we do a set of configuration. Those configurations are coming directly from the Intel VFIOMdev framework to build virtual GPUs. But we use it to create virtual nicks. We ensure that the transition from kernel to user land happens in a very smooth way. And we leverage a very generic VFIOMdev framework to pass all the relevant structures to control the hardware. So the doorbells, the different, if it's a PCI device, the different config spaces, etc. So when the application starts it's again, it's just using the VFIOMdev system code. So we have not defined that, that's already in kernel 4.10. We just make sure that we added some semantics into the framework that is about packet queues and packet arrays. And then the user land actually does the allocation of memory in a way that is consistent with the different IO models we saw. And there is something that has to be done in this context as we're dealing directly with the hardware. There is one thing which is a little bit complex and that's DMA syncing. So in some architectures, DMA operations are always coherent, which means that you don't have to deal with cache validation or flushing in those architectures. But most architectures are not like that. And you need to deal with that. And so that means that at some point you may have to do what the kernel does but in a smarter way so that we don't lose the performance by doing DMA syncs on every packet. You may want to do like VPP, you may want to do DMA syncs on a vector of packets. So what this means in terms of code, we tried that on a number of systems. And that's the kernel driver code base. And that's the kernel and user land lines that have been added to actually do packet IO. If you look at the DPDK PMD for this, we're closer to 30,000 lines because it's replicating the same stuff. Well here we're looking at a very reduced. If we look at performance, so on this 40 gig card on the transmit side, we reached the DMA transaction per second rate. On this one, we don't know if it's a hardware limit or if it's the way we actually handle the hardware. For this shell show card, we beat the DMA transaction per second because they have an IO model that allows it. And we know that we don't know how to properly drive the hardware yet. And talking with the shell show guys, we know that for the T650 gig card we'll be able to receive and transmit at line rate. But that's the goal for the next few weeks. So what's next? I think I'm running, I'm good. We haven't yet pushed an RFC because we needed to make sure that at least it works. It was not clear. And we really would like to talk with the AFX DP guys so hopefully we'll be able to talk with a beer later today. We don't really want to say that that's the way to do things. We'd like that the community finds a way that is acceptable so that we reach a certain performance level and a certain, let's say, generous CT of architecture support and also maybe a goal a little bit beyond the network. So this is really about discussion. Now the kernel has no, that's a personal topic. Usually the kernel considers the hardware is good, that it drives it and that's good. But what happens with the management engine and all those stuff and the fact that GPUs have programs running on it, that network cards have programs running on it, you can't really trust the card to not do DMA on any area of memory. So I would like, we always consider that devices should always be now behind IOMMU so that whatever happens inside the device, the firmware, the software does not attack the kernel from outside. There is something else which is the coherent interconnects. So you have C6, you have OpenCAPI, you have Intel has, I don't know the name for that. I don't know if I should know the name for that. Anyways, those models will radically change how we should build device drivers. Because for example, we always think that there is DMA so we have a packet and we want to move that packet from the card to the host. But with coherent interconnects, you will just be able to pass a pointer and say that's the packet. And if you want to access just the header, you will just have, you will bear the cost of the header but there will be no transport of all the packet from the adapter to the host memory which means that typically we try to avoid scatter gather list and have a tell room and head room to be able to add tunnel headers but with C6 and coherent interconnects that may be totally useless. So I think this will change the way we do drivers and GenZ is almost multiplying the memory bandwidth by a factor of 6 to 8. So this will also change how we'll see memory. And because of that, I would like those ideas to be integrated in the user line frameworks and I'm done. So any questions? Why do I create a bidirectional mapping? You mean the ring descriptors? I don't see bidirectional. Sorry. In the middle. Ah yes, bidirectional. Because if you do a packet forward from port A to port B, when you first get the buffer, you don't know what you want to be flexible in the way you will be syncing. So you can allocate the buffers bidirectional and then do the mapping at the latest moment. But essentially what is important is that the IOMMU domain has to span all the IOVA has to span all the devices so that one IOVA address is valid for any descriptor that you will put into. I don't know if it makes good English. Descriptors are different. As they are maintained by the kernel, they are not in a separate IOVA but we can consider that they are in a separate IOVA on which we have a user line mapping. But we don't control the bus address of that because that's the kernel who decides. And when we map those areas from New Zealand, we never specify the DMA or the bus address. We just specify a VFIOM-DEV region ID which will be then translated so that we don't mess around with addresses from the kernel. Okay. So we have no more time for questions. I'm afraid. Thank you.