 So this is like, it's connected somewhere. Yeah, okay. But is that just for the YouTube recording? I guess so. I'm trying to not know how they do it, how it goes. I don't know. I don't know. Are you ready? I can't hear you guys. I don't know. Okay, guys. Thank you for coming over. We'll start in the next session. Just a couple of small things. Please keep your mobiles in sound mode so that we do not disturb the speaker. If you have over 30 minutes of this or any other rooms, please be quiet with the doors. They're quite noisy. Small announcement today at 4.30 p.m. We will have a competition with prizes that will be in Room D105. So everyone is welcome to come over and compete for prizes. That is all about the organizational announcements. And please now welcome Jesper. We will be talking about the stuff we're going to be talking about. Thank you. So yeah, I'm going to be talking about some of the challenges we have with the network stack when the equipment starts to get quicker and quicker at increasing speeds. So I'm sort of taking a challenge to make the current stack scale to 100 gigabits. So what will you get out of this? There will be a lot of numbers and I'm sure I'll lose half of you in the middle of the session. But I actually want to try to give you something also to give you an understanding of what's going on with this. What is so crazy about this time budget we have when doing this extremely fast network. And I'll also talk about some of the accepted changes we've done. I've recently accepted changes in the option kernels. And I also, actually I moved this up to the memory allocation limitations. I was going to talk about that in the future session, but now it's actually mostly done. So and then in the future session I'll talk about what we still have to do. So we are definitely not done yet, even though I think I've been working on this project for like two years or something. So 100 gigabits is coming very soon. So what happens when the network speed increases? What happens is that the time between the packets gets smaller. So that means the network stack and the software has less and less time to process each packet before anyone arrives to sort of keep up with the network. So if you look at the timescale, look at big Ethernet frames. So there's like 10 gigabits, there's like 1200 nanoseconds between the packets and they correspond to like 800,000 packets per second. And then 48, we have about 300 nanoseconds between the packets and the college ones. So like we have to have like 3.2 million packets per second. And then we go to 100 gigabits. And now we're starting to get into troubles, but that's only like 123 nanoseconds between the packets. And then again college ones around we have to be able to handle 8 million packets per second. So that sort of challenges the network stack and what can we sort of do about it. So I've actually recently got some 100 gigabit nicks, but most of my testing is actually done on 10 gigabit nicks. And what I do to sort of simulate the worst case and pressure the stack to the edge so I can like improve the performance is that what I did is I turned down the frame size to the smallest Ethernet frame size you can send. And then I calculated what's coming in nanoseconds is like 67.2 nanoseconds between the packets. That's the worst case situation I can create. So I'm using that to see when I include the kernel. In that case it was included. I'll be ready for when the 100 gig nicks arrive. I'll talk more about this number here. But what you should actually look at is how many like CPU cycles is this because it relates to this number. So with 3 gigahertz we have like 200 CPU cycles for each packet. You cannot go like buy a faster CPU, but it's only 200 and 69 packets. It's 69 cycles. But another thing is that they actually do exist. The 100 gig nicks. They actually took a picture so you're actually believing it. So actually it says 100 gig. So in my lab I've been testing with them for some while now. Not too long. It's fun to play with new hardware, right? So is it at all possible to do with this kind of hardware? So at least with 70 gigabit there are some kernel buy stack solutions out there. And they have been growing over the recent years. And they've sort of pressured us into like looking at the kernel and looking at ourselves and saying do we really use the hardware ultimately when like out of 3 bypass solutions can do something faster? And because it is a very artificial benchmark that like GBTK does, but they actually show that with the same hardware I have they can forward the smaller stream size and they do it on a single CPU. It's a little bit fake, but I can just take in the received descriptor of several of them and don't even look at the packet and just put it over in the transmit ring of the hardware and they go. So it's fairly fake. But they do show that hardware can actually handle these speeds. So why can't the kernel? So this is a little bit controversial because I'm upset about the kernel developers. So everybody has been working on just scaling the kernel to multiple CPUs. And it's been really great. The kernel is quite impressive work to make the system scale and make it like work scale perfectly for several cores. That's really nice, but what we've been sort of been hiding a lot of regressions for the efficiency per core. That's what I'm saying. And I actually also claim that during this we have affected the latency sensitive workloads. So I think we should look at improving the efficiency per core. So for example, for IP forwarding we're doing like one to two million packets per second, forwarding IPv4 packets. The kernel does scale up. Not if you have like to 12 million packets per second if you have enough CPUs. But I have a little hard time with the bypass solutions. Okay, they can do the one CPU. I want to do better. But yeah, but also it's like comparing little bit like comparing rockets with airplanes. Like rockets actually just have to fly up and serve it really faster. We have to carry the passengers. We have to actually pack the passengers in the plane. And we have to make it comfortable. We don't have to be comfortable in there. And we have to take the packets out again and have to get their luggage. So we are more airplane type of a solution. So now I'm going to try to explain you this crazy time scale. I'll be interesting to hear afterwards if anybody actually understood it. See if it blows your mind. This time scale is crazy. I already mentioned that we only have like 200 cycles. But for you to understand this time scale, I'm going to relate it to some other time measurements. So you can sort of get a picture of what are we dealing with here. So I did a lot of the benchmarks on these benchmarks. And most of you are on this type of CPU. But I think not that fast with the gigahertz. So we can get a little bit faster if we just crank up the gigahertz. But let's see what's on this specific CPU, what happens. I measured like a single cache miss is 32 nanoseconds. Oh my God, that's not good. What was my budget? It was like 67 nanoseconds. Uh oh, just two cache misses. I'm out of my budget. That's not good. And then looking at, we have something called SKB, which is a metadata structure for the packet. It's for cache lines. Oh no. And we insist on writing zeros every time we allocate a new one of these. Uh oh, did they have our own bar already? Well not exactly, but it's not full cache misses. They usually are very hard to solve. They've either been level three or level two cache. So I went ahead and measured that. So like for the L2 cache access or miss, it's like four nanoseconds of maybe, if you're really unlucky, eight nanoseconds. That's like a time period I can use. I still have a chance to actually do this. Another thing is we sort of always have a cache miss on the packet data itself. But the Intel is coming in helping me a bit because they have this new smart feature called data direct IO where they will actually deliver the packet data in level two cache. So now we have a hit of eight nanoseconds of accessing that. It's okay, okay, it's looking good. But then we have this time check for decision making this we have to do in the kernel. So every time it takes a lot, that's an expensive operation. So I measured that. So, yeah, there's something called, in a similar drawing, a lot prefix. So I measured this on this specific CPU, like eight nanoseconds. So that's not so good. So that's just for blocking. I also have to unlock this. So I measured that. And actually it also takes 16 nanoseconds. So that's starting to eat up some of my budget. I shouldn't have too many of these, right? And this funny thing is just going on if you have, like, if there's actually two CPUs. So this is like the ultimate case. If there's actually two CPUs contesting on the same one, yeah, yeah, yeah. Yeah, we feel that out in kernel. We do, like, really good scalability. I also written down, I measure on three different CPUs and measure the cycles because the nanoseconds cost is a little bit strange because the nanoseconds is a time measurement and the time measurement depends on the gigahertz you are running. You can just crank up the gigahertz and do that faster. But if you look at the cycles, the cycles is much more stable between CPUs. So you can actually measure this number a little bit more. So, what happens if you actually need to talk to users based on your system code? Uh-oh. The question I measured was like 75 nanoseconds. Uh-oh. I thought this was like never going to fly. So, and then I actually found out that this profile and the file of the audience was frozen quite a lot. So we ran 42, around 42 nanoseconds. It is a large chunk of my pocket. So, should I worry a lot? Well, I should rewrite my users based applications to use some of the APIs we actually already have. Like, I see almost no applications. This is only for UDPs too. But I tried to search the internet. I don't know if nobody uses the extra M in the send messages and receive messages, which allows you to have multiple receive messages. So I wrote an example how to use it and put it on my Github and hope someone else won't be able to. So, but we also have send file that's been really popular. It gives a huge performance difference for surveying files on the web server. For example, we have other other tricks where you can actually invite this system to do one system call per packet. So we should be fine. I think that a lot of applications might have to actually start using this. So another thing you have to do when you're optimizing this, you have to know the timescale for different synchronization mechanisms. That's why I'm trying that. And this is on another CPU, so that's why the log doesn't correspond to what I said before. It corresponds to most of the cycles. So I discovered something which was quite weird for me. So this is the cost of locking and unlocking to 30 or 40 cycles. And then you have something where we just disabled the local CPUs, the local IQs on the CPU. And it was more than just taking a lock and releasing. So that's with the save and restore option. So you have to save the CPU flags. That's, apparently, quite expensive. If you already know what interrupts state you're in, if you already know the interrupts are on in this context, you can safely just disable it and enable it again. The problem is if you don't know the context, you have to save before you restore it. So you make sure you're not having the native and the nonce. So you can actually save quite a lot. If you save like 30 cycles, that's quite amazing. If you can replace one of the save and restore with just a disabled enabled. So we did that some places in the kernel where that made sense. So how did these, like we've seen all these numbers and when they add up, it's like, how do we do this? This sounds like an impossible task. So how did the other guys manage this? So the main trade, that's batching. Batching is like on all different kind of levels to see what's going on. And another thing is we found out was the transmit queue. I think I have some more slides about that. So we have all kind of tricks to pre-allocation and pre-fetching. The first thing on the same view, we're already quite good at that. In the kernel, there's some of the things that give us our scalability. And of course about the locking, which I'll discuss before. And there's also streaming the media data, which you can just just call us and talk about that. So faster cache optimization. These are structured aligned correctly. We've done a lot of that work in the kernel already and something we liked to do. I also want to go in instruction cache misses. It's something which we haven't optimized a lot. It's something that the compiler has solved for us. We tried to do a little bit in the kernel and we know some part functions. We are in binding them to make the instruction cache pre-fetching more efficient. But we haven't really optimized it a lot in the kernel because it's actually sort of invisible. We really put it using the Perf tool to measure stuff. But instruction cache misses are very difficult to perform. But let's look at the main fundamental tool, which is sort of batching. The main challenge we have here is that we have a per packet processing cost. And that's sort of what it says here. So we only have this very small amount of time for each packet. But if we just use batching or bocking, then we actually have some opportunities. But a sort of word of caution, we should only do it where it makes sense. Like the BIKA solution just always bulk. Because you can actually introduce latency if you just always bulk and wait for opportunities to bulk. So I want to live in that area where we actually try to avoid introducing latency. There is sort of, I think, the value that the kernel can bring to a solution where we scale all the way up where we basically introduce latency like the BIKA solution does now. So a really simplified explanation how we can increase our time budget. So we just removed it from packet overhead and, for example, you just say, okay, I'm always processing 10 packets for each bundle or batch. Then I can basically just multiply the time I have. So now I have like 672 nanoseconds between each of these 10 bundles coming in. Then I have to handle the next 10 bundles. And it's oversimplified because it's not that easy. But so basically, that means I have two bundles. I can do that. And that's basically what, like DPDK, they always take 32 packets. They sort of wait for them. And then they sort of have a lot of time budget if you can amortize away all the overheads per packet. This is a little bit more difficult in practice. So what have we actually done recently? Quite happy that this section of my slides are getting bigger and bigger. But where actually the, because the team in last year was, we figured out how to sort of what I call under the true potential of the hardware of the transmit side. So basically the lowest level of the driver. We have this testing tool called Packaging. We can now demonstrate that we can actually move once a view on Linux transmit device with 10 gates, which is like forcing an 8 million packets per second. So it is cheating. We're just spending the same packet and stuff like that. So what the primary trick was going on is that we are bulking packet descriptors to the hardware. So I had a really hard time figuring this one out. Because I had optimized it. I had optimized it. I had optimized it. I had optimized it. I had optimized the packaging. We went from start with, we could only send 4 million packets per second and we were like reusing the same SKP also. So but we didn't, we could only go 4 million. Then I optimized it. Everything I could to 7 million packets per second and then I couldn't move any file further no matter what I did. And then something looked strange. There was a lock, costing a lot. But this lock was not contended. So this was strange. That's the only thing I could see in my perf profiles is this lock of the Q display was contended. That's strange. I'm running a single CPU test and we're supposed to scale perfectly and I'm sure that nobody else needs this lock. In the cabinet, I took my calculations and it took us like 8 nanoseconds but it cost way more. Several hundred nanoseconds. Something must be wrong. So then I started trying to I'll just remove the lock. I'll just compile it again without this lock. Of course if some other traffic came it couldn't crash. This is my test lock. I can do that. And then all of a sudden this one assembly instruction was playing for 80% of the time. That doesn't make sense. I tried to change something else and then the instruction just moved. What? So when it ended up I finally figured out it's because we're writing the tail pointer down to transmit in the hardware. And this is sort of a PCI write. We cannot see. The CPU cannot see it. And the CPU tries to be smart. And then the next time it sort of gets back the next atomic operation will be playing. Some insert guys later told me that's what's happening. That the next atomic operation will basically get the playing because we cannot show what we're doing at this point in time. That was quite interesting. When I did the changes of figuring out the delay, if you just delay the tail from the right to transmit, put several packet descriptors down and then write the tail. Then I jumped from 7 million to like 13 million. And I was like, this cannot be true. Something is wrong. I had to double check all my resources. It was true. And then we took from the 13 million and then we optimized it to the full wide-speed. Quite exciting to see the jump from 7 to 13. So, I always have to I don't want to introduce latency. I would just wave the packets and just do this button and it would look really good. But I want to try to see how can we integrate this in the kernel without introducing more latency. So, we actually have implemented it. So, what we did is we have now called X-Mid More API. So, we extended so the way you use this API is that we extended the SKB with the X-Mid More indicator which is just a bit. So, the stack uses this bit to indicate to the driver that another packet will be arriving like immediately after the promise the stacks give the driver. I'll make sure that I have one more packet for you when I'm just going to return a promise the stack gives and the driver says okay, I'm going to get these packets. So, on this, my transmitters filled, if it's still I will flush it. You can simply add the packet to the hardware to the Raspberry and defer the expensive indication to the hardware. So, that's nice. Now we have this feature. But when it should be actually activate this because the hard part is using this without introducing latency. That's one of my goals. So, we show that the hardware is really needed and it should be based on some solid indication from the Netflix stack. We could just speculate like the layer a little bit on the transmitters. It's really hard to resist because it's actually so good. But it's like okay, I'm not going to fake this. I want the correct technical solution for the kernel. So, what we ended up doing is we changed the transmit layer and we adjusted it so we can send SKB lists down to the transmit layer. And when you send an SKB list down you will see that the SKB next one is set and you can use that as an indication if there's more packets when we are in the transmit loop. So, this also helps what we call TXQ lock which is the lock down to the driver. With what's called the driver, the network takes a lock. So, I'm not actually you cannot call the same driver on the same transmit queue. And then we're sending this entire SKB list so that also helps us to multi-size that lock down to the driver and then we can multi-size the tail ride. And then then we have this we already have actually packet activation going on in the network stack today. It's something called generic receiver flow generic signature now floating. And we also have some hybrid feature called TSP signal where it helps us with a bigger packet. So, we already have this feature at least for TSP it works quite well that and the lower receiver here we will see if we can build off a packet, a super packet consisting of several smaller SKBs and sending that down. So, that was sort of easy to allow those to be bulk transmitted down towards the queue display. Which was really nice, but and I'm not introducing any added latency. This, beforehand, had actually introduced a little latency for TSP but it's totally a problem that there's some we also just some, we could do some validation in a step where we didn't have a lock so that, but that's another optimization. So, this was the existing one. So, what did we do for the queue display? So, looked at the queue display and then realized, okay, if there is a queue in the queue display in the queue, then is the most solid indication that you can start bulking. So, that's what we implemented. So, packets have already been delayed so it's so easy to construct this list that I bought I bought this adding latency because the latency have already happened. It's actually one of those really rare cases where we actually we have bunched some packets together, but now we have it in the queue and instead of doing per cost per packet now we realized this and actually we lowered the latency for this packet by taking out a big chunk of them and sending them on and we also have a lock down to the queue to transmit it's only the DQ side we have implemented now. We don't have an NQ side I'll be working on that but so the DQ side we have amortized the cost because we will take the lock once and we can pull out several packets and there's some extra locking costs I'll come into that later So, another thing we chose to do So, one problem when we start to do this walking down towards the hardware is we can always use what the hardware actually is capable of handling and then we'll send a return post and then what should we do with this packet should we drop them, no we have a queue list there, can we re-tune it into the queue list there because it's a complex structure of queue lists to support that so queue lists right now just hang it off and do a header line but this is the next packet we have to transmit but the hardware does sort of stop in the middle of this flow So, we chose that for drivers who has implemented something called PQL by Qlimax which is a way of avoiding buffer flow if you know that term the drivers to avoid too much outstanding packets in the hardware and the experiments are really good I'm not going to implement re-queuing into the individual queue lists in support for that because the results were so good So, we also have done a lot of other optimizations one of the most prominent is the optimization of the feedback which is the feedback and Alex, he improved that a lot together with a lot of other optimization we basically made the IP forwarding performance we doubled the IP forwarding performance on the kernel so that's been really cool So, I only have 10 minutes left but this is what we've achieved over the past two years So, for the lowest layer I started at 4 million packets per second as I talked about and we can actually send full wires now So, and the lowest perceived layer we still have some work to do this is all a single call test so we started at this is still very experimental result we started at 6 million packets per second I'm optimized it to 12 million packets per second this is only the lowest layer so that's what I can do so we've got some issues that's why I shouldn't deliver there and I can go ahead and do IP forwarding we optimized it for doing 1 million packets per second 2 million packets per second per call so if we do multi-call we have quite good scalability so we actually a multi-call we actually do 12 million packets per second today forwarding to smaller packet size and that's a real 7-2 benchmark So, what do we need to work on? Well, I have a talk next week where we only talk about what we should work on and I don't get the introduction of how we actually understand the language So, we have a lot of actually we are still not taking full advantage of our power or transmit capabilities I've been talking about I still see stalls on the tail part on the right indirectly by looking at the lock from the Qlisk layer We also have some limitations with the seed layer we have to deal with and sort of the baseline overhead of the Qlisk we have to do with that and memory allocator is hitting the slow path I'm also optimizing the memory allocator we have some cache misses we have to fix let's see if we have time to look at all the slides performed I've got the overview so so transmit looks really good so how does receive perform So, remember we have we have to reach 8 million packets per second with quick frames for treating with 100 packets So, I tried to 100 packets but disappointed to see like 6 million packets per second we have a lot of driver tests where I'm dropping that so I optimized that to 12 million I use a lot of tricks there I borrow the cache miss I do an instruction more efficient instruction cache use I use most of the overhead there is memory allocation so I optimize the memory allocator to do bulk bulk allocant free again to amortize to bolting sort of a general term you know theme going on and you are actually also for this case had to tune the swap allocator I'll try to put into the kernel so we do more automatic tuning of this for different use cases so I sort of extrapolated what I can where I believe I can full out so I believe I can optimize to like 19 million packets per second in that lower layer okay only 5 minutes so I was also on to my instruction cache misses so one thing I noticed is that if you combine with GCC5 instead of GCC4 it's much better at relying the right stuff and laying out the code correctly so like a time to 10 reduction in the instruction cache misses and the performance going up significantly like 100 nanoseconds more than 10 seconds so there's a lot of things to do in this so I want to optimize and do some stages and we will process these qualities I think I'm going to rush so so I identified that the memory allocator was a bottleneck in the network stack so in my sense almost done but I'll just explain to you what's going on so we did this artificial receipt benchmark dropping the packets early then we don't see the memory allocation problem but the real use of the network stack what's going on it has too many outstanding packets so the receipt layer actually pull out 64 packets put it into the first mid-ring and then the problem on the wires because we had to wait do they make transfers and then we can clean up the packets that we made and then we end up we have at least 256 outstanding STPs at this cost the stupid allocator to hit the slow path of the allocator so I identified this and used quite some time to optimize the allocator first I implemented the memory allocator and just to do a show me the code example of I can do it faster than the memory allocator then people objected to that they're doing a cache on top of a cache so I had to fix the real memory allocator so it's actually quite impressive to start so the normal slow fast path is like 42 cycles but we are actually hitting the slow path that's 100 cycles so like actually I have yes I have optimized this also but yeah so what happens I was just too bulking instead and I can reduce it quite a lot so actually if you're bulking like me there are a lot of elements down here I'm at 30 37 cycles and that's faster than I can do the fast path but I'm really happy about that work that we actually get a huge performance benefit with that the patches are on the mailing list and will hopefully soon go in and then we have some things about base work of the QDIS here I'm not going to go into that because I don't have time but I estimate that we use like 70% of the time is used in docking operations so we should do a lot less QDIS so we just need to do that and John Festerman from INSA has actually proposed a solution for doing the dockless QDIS based on our dockless Q I implemented yeah so a little bit commercial all the changes most of the changes have been backhauled so we had 7.2 and we saw an almost performance boost there with us we can look at Alex's talk from Newscan where we actually doubled the raw performance afterwards with all this work so that's quite impressive and then I have like I talk with all the other QDIS but what we should do about all the things I talked in the future section so anyone have questions what's the latest about just the power of these all installations or it's not necessary it will like I'm very very satisfied to see when we start to do docking but in reality it's I'm trying to find out measurements the way I'm doing measurements now is I'm taking this one second and we can have new spikes changes but I decided this way if you have the first packet a little bit but I'm actually I think I'm getting a lot middle lazy if you have a little bit and you go into the you don't if there's happens to be more people all the people go into the level and they actually see less latency each individual person going all the way up to the network what we have to do today and we're turning so by turning on these I'm actually removing a lot of latency seems like we're not going to do this too many latency spikes do you test the 100G with that information that you actually can generate that I use with which I basically go my way because the information that you are using you're going to find this card are you okay yeah some of it is not that I'm going to help them out there's no problem but some of it some of it is not their fault it's the network stack that needs to add to this new challenge of such past efforts so I'm going to cooperate with them and I have my own and I have to play with new hardware you should get okay I think I have a lot of time so thank you so upload it on my can you copy it for the organizer for the organizer I thought you were going to respond to him is it is it kind of like on the close or is it kind of like