 I'm not going to mangle his surname, he's going to talk about challenges and increasing the network stack performance. Yes, thank you. This talk is also a little bit joined with Kristoff. When we reach around slides 28 or something, we might stop there and take it together with Kristoff's session. But we'll see how we are on time. How much time do we have? 25 minutes. Okay. I should have this timer here. Then let's get started. So today I'm going to talk about and make you understand the challenge of doing 100 gigabit and the insane time budget involved in that. And I also have some measurements and I also want to make you understand the crazy time budget. And so I'm going to give you something to relate this time budget to. I'm going to talk a bit about some recent changes which have been accepted. Where we do transmit bulking with X-MID more API and we do Qtisk DQ bulking. We only did the DQ part so far. Then we have some future work, which is like on the receive path. We have some work, we have more work on the Qtisk path and the memory management layer. I'm going to talk to that with Kristoff. It's going to give the next talk after lunch, I guess, about that. And we'll see if I'll do the information about the memory allocator limitations. If I will do that in the next slide. I implemented what I call a Q-Memple, which is a lock-free bulk and lock-free scheme. Just to show that we can do something faster than the slope allocator. So 100 gigabit and Nix are actually starting to come. So it's challenging the network stack quite a lot. When we sort of increase the rate, the time between packets gets smaller. And we have to use less time in the network stack to process packets. If you take a full MTU size packet on ethernet, you can see on 10 gigabits, which we still have some room to work with. But when we go down to up to 100 gigabits, the time between packets arriving is only 123 nanoseconds. That does not a lot work. Inter-packet gap. The inter-packet gap. Yeah, it's actually also calculated into this because that's why I have the MTU size. So the inter-packet gap is 38 bytes. So I actually don't have access to 100 gigabit Nix. So I sort of did a problem solution. I'll just take my 10 gigabit Nix, which I've had a long time actually, and just send smaller packets. So the smallest packets I can send due to the inter-cap of the ethernet overhead is 80, 84 bytes. So that actually I can simulate something which is faster than 100 gigabit. So the inter-cap between the packets is 67 nanoseconds, and that actually translates into a 14.88 million packets per second. I'm going to use this number throughout the slides. I use that as a target because that's the wire speed at the smallest packet size and 10 gigabit. So CPU-wise it's only 201 CPU cycles at 3 gigahertz. That's not a lot. It's just kind of kind of a crazy task amount. But is it at all possible with the current hardware we have? That's also an interesting question. This is very important. I work in the industry and the time doesn't seem to be that important. Is that exciting? Yeah, thank you. So that is actually over the recent years. There are three network stack implementations which sort of bypass the kernel altogether. And there's one called Netmap, which FreeBSD has. And there's also some other solutions to DPDK from Intel, Package Shader. There's a lot of other stuff. The RDMA and Infiniband stuff which Christoph told me about, which has been there for some years actually. And they have shown that basically showed that in the kernel we are not using the hardware optimally because they can do quite significantly faster. So even though it's an artificial network benchmarking, they can show that they can forward 10 gigabit at wire speed at the smallest packet size on a single CPU. So that's worth noticing that they do it on a single CPU. That's sped across multiple CPUs. So the hardware should be fast enough. So the way I see it is that the kernel has been scaling out with the number of cores. And we have sort of missed out on adding latency and more code in the kernel. And we are sort of hiding latency regressions. And I think Christoph would agree with me that latency-sensitive workloads have already been affected by we adding more stuff in the kernel and the network stack. And then we say, no, it's fast enough because you just add more cores. But actually we've added more latency in the packet processing path. So my goal is to improve the efficiency per core. For example, today if I do a test, I only enable a single CPU. I do a CPU test. I do IP forwarding. I can only forward between 1 to 2 million packets per second. And the bypass alternative is like the YSB 14.8 million packets per second. And that's a really good compare because the network stack is obviously doing more than the sort of fake benchmark where they just forward packets blindly. But it does show that the performance gap is so big that we should do something about it. So I want all of you to understand the crazy timescale we are talking about. So I've talked about these numbers before that we only have around 200 CPU cycles. So I want to give you some numbers so you can relate this to something. I've done a specific CPU testing. So on this specific CPU I measured that a single cache miss takes the 32 nanoseconds. So basically you can only take two cache misses and my entire budget is gone. So that's not good. Cache misses we don't want those. And what we have today is that we have the Linux SKB. We today are structured to our packets. It's four cache lines. And we insist on writing zeros to these cache lines. And we allocate the SKB. It's usually cache hot so it's not a full miss. So let's look at what the timescale or cost of the layer 2 and layer 3 cache misses are. So as I said, the SKB will usually be in cache. But what about the data? The smarts of you I have here is E5 something. Intel added us some hardware which can add the data part of the packet directly into layer 3 cache. It doesn't matter. It publicates it directly so that doesn't have to go through the memory. It happens transparently. I've had a bit trouble finding a measurement tool where I can actually see if it happens or not. But it's quite obviously when you get the cache misses that it's not happening. So I sort of went that way to see if it happens or not. So on this specific CPU I measured that the costs were around 4 nanoseconds and around 8 nanoseconds for layer 2 and layer 3, respectively. And that's the timescale we can work with. So what's the cost of the log and assembly instruction which we use for different synchronization purposes in the kernel? So I measured the cost to be around on this CPU around 8 nanoseconds for log operation. And I also did a test with a user. The optimal case where there's no contention on the log and just log on log on the CPU. And the measured cost is around 16 nanoseconds. I don't know if it all makes sense to have 8-point-something nanoseconds. What? Yeah, it's hard and cache. So this is the most optimal situation. There's no contention. So in my time budget it's still fairly costly if you'd have too many of those. Paul? The 425, the part of the yellow fraction of the box I call it, the fraction of the ass. The other part is the same thing, different times, depending on how things are for the unit time. And so you'll, unless you, so you'll end up, you're maybe just on the right end of the box. And so what happens is it will take different paths with the hardware and get different lengths. Yeah. Yeah, so it can quickly cost more. So what do we actually want to talk to user space about this in the package, actually? So the overhead of a system call is quite large in this time scale. First, when I first measured it, it was around 75 nanoseconds. And then I figured out that the audit suce call stuff from AC Linux was quite costly. So I disabled that and I measured it down to 41 or 42 nanoseconds for a suce call. So it's a large chunk out of our budget. So what we can do about that is that we can sort of, I'm also size the suce call cost. And there's actually several existing suce calls which can do that. And you should notice the extra M here, the extra M here. So we can send several packets back and forth. I looked at these, actually to send M message and receive M messages. They're not as good as I expected. And we should actually improve the performance of those. Yeah, but it seems, my test at least, I didn't look at the code specifically. I looked, I felt a short time on it. And then I moved on to other stuff. But there's a lot of room for improvements in these interfaces. You also have the send file. Yeah, there's a lot of tricks you can do to get around this suce call overhead. So it's also interesting to look at some of the synchronization mechanisms we have inside the kernel. So we have, as before, we have the spin lock, the lock and unlock for the cost of 16 nanoseconds. Whereas the bottom half disable, it costs seven nanoseconds doing that. And for example, the IQ disable enable is there. It doesn't cost a lot. The sort of surprise to me came when I measured the IQ save and restore because it costs 14 nanoseconds. Which was a lot higher than I expected. And we use that quite frequently in the kernel. It's almost as costly as taking the lock, which was quite a surprise to me. And we have it in the slab allocator, right? Or I think I saw some patches where you tried to remove it. The slab allocator has none on it. No. Then it's the slab allocator, which has that in the hard part. So how did the other solutions manage to actually do this and use the hardware? So one of the general tools of the trade is batching as the main one. And we also have preallocation and prefetching. Obviously staying on the same CPU or numeral node. Generate and change in kernel entry. The what? It's not out of free. It hasn't been there for years. Yeah. So it says out of free. Yeah, out of free. Yeah, but doesn't it only work on InfiniBand? No. Okay. It was constructed by InfiniBand. And then the guys discovered that you can't use it. So they used the ID. Okay. Interesting. I should try to play a little bit with that. Yeah. Then there's, obviously, shrink the metadata and reduce the system calls. Yeah. But the batching is sort of a fundamental tool. So instead of doing the packet processing, we can do batching or bulking. But only where it obviously makes sense. And batching can be done at many different levels like just called batching and packet processing batching. So it's quite simple example. Like if you, the cost, the cost of blocking was 16 nanoseconds. So if you process a batch of 16 packets, when you take the log, the locking cost is sort of a motor size to one nanosecond. If you look at it that way. So in any questions about that, the previous document, I'm going to talk about some recent changes. So what we did recently was sort of unlock the transmit side of the drivers. So now with packet generator, the internal packet generator, we can actually generate the 14.8 million packets per second. This is the 10 gig wire speed I've been talking about. We're just spinning the same SKB. It's completely fake. We have no memory allocations. And what the primary trick here is actually we are bulking packets to the hardware. That's the trick. So the thing what is going on is that we are deferring the hardware tailponder right which notifies the hardware that packets are available. So what happens is that we are just filling up the hardware transmitter ring. And then when we have a certain amount of packets we call the flush operation to the hardware. And that makes us being able to hit wire speed with a single CPU with this fake workload where you're just spinning the same packet. I had a really hard time finding out that this was going on, that the tailponder right to the hardware was so expensive because it didn't show up in any performance profiling with the Perf tool. So the hardware didn't because I guess it was an external device being written to and it was non-casual memory and stuff like that. So the Perf couldn't help me there. So what happens is that the MMIO right will in Perf most of the time it was sort of jumping around to the cost. Most of the time we were playing the next lock operation. So it looked like the next lock operation in this case would usually be the cutest clock. So we'll pop up on top and say the cutest clock is very expensive. But in reality it was the hardware which was slowing me down. So they can tell us exactly how long all the hardware ops were taking and what lay across things like the NUMA interconnecting between nodes and so on and what. And so we could actually determine that the CPU music was actually coming from the hardware. And this was done as part of the more device bring up type process. So this sort of thing traditionally has been done at the hardware level and optimized at the hardware level before it's really got up to the point where we started looking at the network stack. Yeah. I would love to get my hands on some PCI analyzer. I don't have that. So I had to deduct it by other means. Yeah. I'll be very interesting to talk to you afterwards where I can sort of get my hands on such kind of hardware actually. I would love to play with it. No, no. No, no. But generally understanding what's available. So I could like, yeah, yeah, it would be really valuable for me to get some hardware. I can actually see this stuff. So what what we ended up with the we ended up with what we call the X mit more API. So the SKB has been extended with a single bit, which is a X mit more indicator. And what the stack does it uses this to sort of indicate a promise to the driver that another packet will be arriving immediately after this one. So sort of indication from the stack to the driver that we have more packets coming. So, so what the driver does is that it checks if the it's transmits ring queue is full. Then it would need to flush to the hardware or else it can simply add the packet to the hardware ring queue and defer the expensive indication to the hardware. That's that's the way we sort of have today and the cooperation between the stack and the driver. The driver also needs to implement this. So the hard part is we want to to do this use this API without adding adding latency. And so we should only do parking when it's really needed and based on some solid indication from from the network stack. So we should not speculate speculate that if we delay writing this tape pointer and sort of bet on packet schools maybe arrive shortly. It's and it's sort of hard to resist because benchmarking would look really good if you do some fake benchmarking. But we should really resist doing that. So what we do today is that we use SKB lists. So we changed the the XMIT layer of the stack to work with SKB lists. Actually David did that. So we simply use the SKB next pointer and and we can also use the the next pointer to see to use as a XMIT more indicator directly. So when once you build all the SKB lists what's going on is that the TXQ lock which is the hardware lock. There's no longer a per packet cost because we take the TXQ lock and and and we have the hard start XMIT which will send the entire SKB list while holding that lock. And yeah and that's that's that's how it's done today. Instead of sending individual packets down you can actually work with a whole whole SKB list. So this SKB list has to be constructed before reaching that point. So the stack already have packet aggregation facilities which is the generic receive of loading the generic segmentation of loading is sort of connected. And we have the TCP segmentation offering which is a hardware feature which comes in and they actually basically already have aggregated the packet for us. So by allowing to bulk these I'm not I'm not introducing any latency building these up introduces a little bit of latency actually. But at least I'm not introducing latency when allowing to bulk these. It's what? Is that hooked into the system? No it's not it's not hooked into the system yet. That's that's something I want. I want to work on so we can have that work all the way from from user space. It's it's it's on the drawing board but definitely there's a long way to go yet. But this is the only the transmit side where we're allowing we're allowing this we don't have it on the receipt. We have a little bit on the receipt of the generic receive of load where it can build up the list but not from user space. And then then we have the Q disk layer with this before transmitting packets out and we have we have to have it on for the hardware. When we have to push this packet cannot send any more packets we have to queue packets. So if we have a queue in the Q disk is a very solid opportunity for bulking. Really good indication from the stack. That is already delayed so it's easy to construct this list of skbs. It's one of those cases where we're actually reducing latency before because before reducing latency and increasing bandwidth. Because before we were doing a per packet cost for for the stuff in the Q disk. Now we are the cost is I'm also sized over several packets when we are taking them out. The Q thing Q disk logging is a bit nasty and has some extra locking costs are getting to that. And it also has other properties. So a bit about the Q disk path over it. It's sort of nasty because it has six assembler lock operations. And as we saw before it costs around eight nanoseconds. So we have 48 nanoseconds of pure locking overhead. And what I also did is then I measured what is the actual overhead. And it's between 58 to 68 nanoseconds. So we are using a very large percentage of the real cost of the overhead of using the Q disk layer on lock operations. So what we did was on the Q disk DQ side it's the stuff is now a multi-size over several packets. Because when we DQ we can pull out several packets and just make it to the hardware efficient manner. But that is only in case we have a Q. We have other situations one is the MTQ and we do something called direct XMIT. And it also has to take all these six lock operations. So in the future I'm going to work on fixing how we can reduce those lock operations. In the case of the MTQ where you can do a direct transmit. And the in-Q side of the story we still have a per packet locking cost. It might go away if we like we have all the way from user space we can send a bulk of packets and then we will in-Q the entire bulk. And that we will solve it that way by a multi-sizing the cost of the in-Q. So the Q disk most of you are this is all the detail slide but most of you are current developers so it shouldn't be a problem. But I sort of tried to sketch how the Q disk locking is working. This also happens for the direct transmit case so unfortunately. So the first lock would take the root lock and the in-Q and possibly we have to and be in-Q because we only allow a single CPU in-Q packets and multiple can in-Q packets. And we control that by the state running and other CPUs will simply exit if there's someone else running in state running and just in-Q their packets and go away. Once we have de-Q'd a packet or now several packets in this in the recent kernel we take the TXQ lock for the hardware and we transmit for the hardware. And afterwards when we come back we will take a lock again because we have to update this state running bit just to indicate if we are done or else we will also check if there has been in-Q'd some packets while we were in this section. We will check if there has been in-Q'd some packets and then either transmit them or if we are overquartered we schedule a soft I-Q. So the locking is a bit nasty also to fix them. So what we chose to do for de-Q'd is de-Q'd a packet. This is that we chose only to do it for the BQL drivers which is the byte-Q limit. The drivers was implemented that feature because the reason for doing that is we want to avoid overshooting the NIC capacity. Because once we overshoot and the NIC pushes back to us we have to re-Q packets. And the de-Q'd disk layer's re-Q facility is not that good because we can have a very complex Q-disk. And if we have a very complex Q-disk and we ask to re-Q it's difficult to ask the Q-disk to re-Q back in the right Q. So we will require some logic to handle that. The BQL facility shows that it's really good at limiting these number of re-Q's happening. So maybe we don't need to implement a better re-Q facility in the Q-disk. Because the BQL is very good at sort of limiting this. But I've done a lot of measurements and it shows that it's acceptable the head of line blocking occurring. So future work, what do we need to work on? As you've noted, it's part of the TX side I've been talking about. The receive side we also have some limitations. Like the user space stuff also has to be worked on as you pointed out. And we have the FIB route logup. It's very expensive with IP forwarding. Very recently some of Alex's patches went in. So Netnext which optimizes the FIB route logup quite a lot. And then on the receive side we also sort of limited by the memory allocators which we're going to talk about with Kristoff. So one of the future things is that I want a lockless Q-disk. And the motivation for doing a lockless Q-disk is that in the direct transmit case where the Q-disk is empty, we still take all these six locks which I think we should be able to do better. Then we also still have the in-Q cost which we could reduce. That is the case where the Q-disk length is something in the Q-disk. We could reduce that from 16 ns to 10 ns, not too huge an improvement but worth taking. So I actually did a lockless Q-disk implementation. So I did some measurements on that. And it's difficult to implement completely directly if I'm allowed to. There's a flag if the Q-disk can bypass in-Q. So it's difficult to implement quickly to avoid reordering of packets. But if I am not allowing the bypass stuff, I can save 48 ns on my implementation. So it seems like it's quite worthwhile doing. So the transmit side looks good and we also have to fix up the receive side. So the forward testing I was talking about before that's IP forwarding. I also did some bridge forwarding which also is around 2 million packets per second. So what we did some experiments tuning the receive side. So what I call here early drop is you drop the package right after receiving it. We only reach 6.5 million packets per second and that's on a single CPU. So we are kind of confused by that. If you just receive it and just drop it right away, why don't receive the full wire speed? So Alexi, the guy who also do the BPF stuff, worked a bit on optimizing the receive path and went up to 9.4 million packets per second by using the BuildSKP API and do some prefix tuning. But one thing is that the early drop stuff doesn't show the real memory, the interaction with the memory allocator. So how much time do I have left? What? 20 minutes. Okay. We could delay this stuff because this is mostly about the memory allocator for the next slot. So this is a little bit sort of important slide for how the network stack uses the allocators. So if I was just dropping the package early, I don't really see how the network stack uses the allocator because what's going on is that first we, on the receive side, we are sort of pulling the receive hardware ring queue and we will pick up too because we have a budget to 64 packets and when we are doing that, we have to allocate SKBs. So we basically allocate up to 64 SKBs. And then on the transmit side, we put the packets into our transmit ring and we don't free the SKB structure right away. There are some complications of doing that. And what will happen is that in number three, that once the transmit completion has happened, we were free up to a budget of 256 SKBs in one go. So that's how the network stack actually uses the allocator. And what I experienced is that the IP folding is sort of fitting, not the slow path but the slower path of the slot allocator. So what I did was I did some microbench parking of the KMM cache slot allocator. And so I was just, I would just in a loop, fast reuse the same element. So I'm really sure that I'm hitting the fast path, which is lockless. And the cost of doing that is 19 nanoseconds with a preemptive enabled kernel. Preempted enabled kernel. And so that looks okay for 19 nanoseconds. Then I tried to do a pattern of doing 256 allocs and 256 freeze. Based on this pattern I explained before. And the cost increased to 14 nanoseconds. Because I'm hitting a slower path, not the slower. That's the cost, what? Is that the cost per alloc in frame or the cost of all 256? No, that's the cost per alloc in frame. So I also tried to derive the memory manager cost by using packet gen. So I sort of hack implemented recycling inside the packet gen reader itself. But I'm touching all the usual data and area and zeroing and stuff like that. When I do that, because I'm just spinning the same packet, I'm not using our real device, which has this delayed free of the SKB. Because then it will free my data up and we have use after free and stuff like that. So I'm using what we call the dummy device, which what the dummy device does is to receive a packet of just free. So it's basically a free operation. So what I did was with no recycling and with recycling I measured how many packets per second and I just let this into how many nanoseconds it takes. And then I derived that the memory manager cost was 77 nanoseconds, which is slower than I expected. I'm quite sure I'm hitting the slop fast path. But what's going on here is the SKB data page. When I allocate that, it's costing more than the slop allocator. Is there a statistic subsystem? Yeah, I'm going to play with that. I haven't used that actually. So my measurements show that the cost is somewhere between 19 to 14 nanoseconds and the packet gen derived. But it's too large for my time budget. I need something faster. So either the memory management area needs improvement or I need something in terms of a faster memory pool. So I actually implemented a faster memory pool. So what I call a Q-mem pool because there was already something called mem pool in the corner. So it's basically a log free bulk and alloc free scheme backed by what I call the elf queue, which is array-based log free queue. And I implemented this Q-mem pool and I used it and plugged it into the SKB allocator in the network stack. In this geos second, save 12 nanoseconds on the fast path. If I drop packets in the raw table, when I do that I'm also making sure I'm hitting the slop allocator's fast path. When I do full IP forwarding, I'm saving 40 nanoseconds because forwarding hits a slower slop use case that we will be working on improving. So I did some microbench marking also against the slop allocator. So if we have the fast path use case where we just reuse the same element in the loop for the slop allocator, it costs around 18 nanoseconds. If I use it in soft IQ, which is possible because I'm restricting the use of the Q-mem pool to only run in soft IQ in NAPI mode, in the network drivers, I can do that. And my fast path is only 13 nanoseconds. If I need to use the Q-mem pool outside the soft IQ, I can do bottom half disabled and the cost. Then I'm up to a little bit more cost than the slop allocator. For the slower path, not the slow path, the slower path. When I'm using the 256 elements allocation before free, you can see the Q-mem pool even though I'm using the bottom half disabled here. It scales better up and it's around 24 nanoseconds compared to the 14 nanoseconds in the microbench mark. But the real use case for the network is that IPv4 actually showed me, gave me better numbers than the microbench mark. So what's the secret behind the Q-mem pool? Why can I beat the slop allocator? So it's primarily because I have bug support in my logless log-free queue. And I have something called a shared queue, which is a multi-producer, multi-consumer queue which supports when I'm bugging elements in and out with a single compare and exchange. And thus I'm authorizing the cost of the logged compare and exchange operation. And then per CPU I have a single producer, single consumer queue which doesn't require any atomic operations at all. And I actually speculate I could implement something faster with a simple per CPU stack actually. But with this I also exercised that part of my library. So what I implemented was what I call the ELF queue, the array-based log-free queue which is like the basic building block. And the killer feature is basically bugging with a single logged compare and exchange. And I support this multi-consumer, multi-producer stuff. Another feature of the queue is that it's array-based. So it's also cache-line efficient because it's just a queue for pointers basically. And on 64-bit I can have eight pointers per cache-line. So that's also some of the improvement comes from that. It also has the possibility of doing pipeline-optimized bug, DQ and InQ because when I'm copying out these elements instead of just having a simple loop where I'm just putting out elements, if I do loop unrolling I can measure that I'm using the pipeline efficiency of the CPU more efficiently. In practice I removed it from the op-film proposal because what I found out if I had a too advanced DQ, InQ pipeline stuff where I used it doesn't matter what I used, but in practice what happened was it would use more simple instructions to implement this loop. And because the code size were larger I would get more instruction cache misses so it actually slowed down my performance adding this smarter pipeline-optimized InQ and DQ so I actually removed it for code size purposes because it turned out that you don't want instruction cache misses more. So we could basically just view it as an array of pointers used as a queue but with bug-optimized lockless access. So the Q-Memple was sort of a practical implementation to find out if it's actually possible to do something faster than the K-Memperial-slop allocation which is quite optimized for quite a few years and sort of also to provoke the M-M people to do a generator code kind of stuff and say it's actually possible. Can we do something to make the K-Mem cache just as fast, integrate some of their ideas, perhaps extend stuff with bulking which the next talk will be about. Gustav will talk about that. Yeah, and we will talk about that in the next slide I think. I think this is the end of my talk. So we can postpone the memory manager discussions but any input on challenges I missed on the network area? The main thing is that you can address the performance issue. We have mostly the latency issue. So the problem is that the network operations for individual packets take too long and that's why we use, for example, the IB-Verbs implementation because we can map the device control structures into the user space and we basically flip the bits in the user space to trigger a device operation. So with this approach you may be able to saturate it to 40GB or 100GB link but you cannot really address the individual packet latency issues that we've seen. And I wish this would have been more integrated into the main stack. The IB subsystem right now is kind of a side guy in the kernel and I think one of the reasons was that the network guys felt that this wasn't kind of an off-road API or something like a TCP off-road API that I didn't want to support and there has been a history of antagonism. I wish there would be more integration there. And yeah, I was a bit surprised when I first read the slides and I didn't even mention the IB-Verbs API but please pay more attention to this. I didn't actually know it. I guess something had to be done. But I'm trying to reduce the latency of the stack. Yeah, I thought that. A lot of other people have been focusing a lot on just the throughput and just scaling on more CPUs which hurts the latency. Yeah, we also have the queuing problem. The queues are added everywhere and so the queue load, as mentioned before situations cause these high latencies. It's a significant problem. Okay, a bit of a different kind of question but seeing how you're working so hard to push the hardware and the CPU itself to its limit, have you been at times kind of closing your eyes and dreaming? What kind of instructions and other hardware facilities in the CPU would have greatly helped me here? What were you wishing for that actually the CPU had that doesn't have now to make this work significantly better? Do you have any ideas on it? Not really. I haven't thought about changing the hardware. Yeah, but I guess I should. So I was going to ask quickly, clearly your talks preliminary and you're doing a lot of work on Intel, have you looked at any other architectures, power or anything like that? No, unfortunately. I've primarily been testing on Intel hardware, that's true. I have been testing the lockless queue or the lock free queue on ARM processors and what did we also test on PowerPC and stuff because we had to test the correctness and we have to add memory barriers for different architectures. But it was only a correctness test for the lock free queue, not performance wise. So there are various approaches right now in the works on the hardware level to actually increase the speed. One in the ARM space is to map a register of the CPU directly to IO so you wouldn't have to go through any IO bus to reach the device. Intel has something called the Omnipath where actually you have an interconnect directly on the processor, it bypasses all the busses and goes directly out to IO and this will cause the ability of the processor to do much faster IO. On the other hand, the network stack won't be able to handle it, but now you have to use CPU instructions to do IO and we are utterly not prepared for this and I'm not sure how we actually could support this in the future. I was going to ask a related question. Intel as well have their DPDK stuff which is more towards the actual devices doing networking platforms rather than just an endpoint. It sort of seems like that in OpenVis which another, there's a whole bunch of these projects that somewhat intersect. How does your work interact with others in that space or is there better coordination happening behind scenes or something? Well, I actually haven't used DPDK myself but I have looked at it and some of the inspiration for improving the network stack does come from the number size I see DPDK can produce. I know other people in Red Hat is working on helping improving DPDK making it more easy packageable because now it's really difficult to just take and use. You would have to sit in the middle of the room. Just with the transmit more flag and stuff like that, if I remember correctly, that actually came from the block layer, the storage stack, the technique for actually being more efficient on your transmit. Have you looked any more at what the current multi-cube block layer code is actually doing from the storage side to enable high IOPS and whether any of that's actually relevant to the network stack? Well, I haven't actually looked at that and I guess I should look at what the other systems are doing. Yeah, because I mean basically that's the equivalent for the storage stack. Now we're dealing with flash devices and so on that are doing millions of IOPS. That's the infrastructure that is being used to scale both per CPU and across multiple CPUs for that sort of equivalent packet throughput that you're talking about here for the network stack. Yeah, but I haven't looked at it and I guess I should ask Jens about it. Yeah, definitely. Anyone else? Looks like we're done a little bit early. Thank you very much, Jesper. That's very interesting. I think you're crazy to try and get done that time.