 So I'm Drew Gallatin. I do free BST performance work for our CDN servers at Netflix. And usually, I give talks about how to make things better. But today, I'm going to give a talk about basically what breaks when it disables certain optimizations. Basically, how do I make things worse? So just to give you a bit of background, there's 800 gigs in the title. And we will get to 800 gigs. And we got to 800 gigs last fall, just after the last year of BSTCon. And that kind of motivated me to look back and see how much the various optimizations that we and the free BST community have made over the years have helped us. So I want to emphasize the free BST community, because most of the things I'm talking about today weren't done at Netflix. Most of them were done by the great people in the community that are some of which are here at this conference. And the other thing you're going to notice is that there's only one optimization between you and no chaos a lot of times. So to give a background of what we do on our CDN servers, we run free BST Current. We run the NGINX web server. And we have the easiest job in the world, because we just serve static files via send file. This is the hardware I'm going to be talking about for most of the talk. There's 800 gig stuff at the end. But it's a lot easier to abuse a 400 gig machine. There's a lot less of a blast radius if I crash something. So I do most of my experiments on a 400 gig machine, which is an older Rome-based AMD with sufficient networking and memory bandwidth. So I came up with a metric compared to these different optimizations, where basically I call it the gigabits per second per CPU metric. And that way we can compare optimizations where, say, the max bandwidth is a lot lower than it could be. So the first thing I'm going to talk about, and I'm going to start with the best setup, and I'm just going to keep making things worse. So the first thing I'm going to talk about is, like, this is how we actually run things, where we're running Nick Kernel TLS with send file, and all of our optimizations are enabled. And we get about 375 gigs at about 53% CPU. And with my new metric, we can call that 7.1. And as I go through the optimizations, I'm going to talk a little bit about the performance and why the performance sucks. And I'm going to show you flame graphs. So this is a flame graph. And the thing to take away from this is that what you're really looking at for slowness, if I can get to my laser pointer, which is stuck in my pocket, what you're really looking for is you can see these plateaus. Basically everywhere there's a plateau you're spending time and it's an opportunity to optimize something. And you see there's not a lot of great big plateaus in this. This is what we look like in our optimal setup. So first I'm going to talk about send file and Kernel TLS. So this is my favorite picture. So basically what this picture depicts is the data flow path. And in this case, it's the data flow path if we weren't using send file and Kernel TLS. So to serve 400 gigs, basically we have to read 50 gigabytes a second or 400 gigabytes a second from storage. And then we need to read it out of the kernel with a copy out so it gets read and written to user space. Then in user space, it gets read and written to encrypt it. And then to put it back in the kernel, it gets read and written again. And then the NIC sends it. And so that means we need basically one, two, three, four, five, six, seven, eight of these 50 gigabytes a second or 400 gigabytes a second. If you go back a few slides, like I said, the memory bandwidth on this box is about 150 gigabytes a second. So that's quite a difference. And so now first what I'm going to talk about is what is send file? So I'm going to be talking about how we get rid of these lines. So first, what's send file? Basically it's something that was come up with in the 90s or so where essentially you give the kernel some file descriptors, a file and a socket. And the kernel, you tell the kernel to send this much from this file descriptor on this socket. And the nice thing is no data goes to user space and the web server never has to even look at the data it's sending. So the first problem we ran into with this, like way back in the early 2000 aughts before we were doing, before we were even doing TLS, is when an Nginx worker is blocked, it can't serve as other requests. And so you can imagine if you're going to a spinning disk, it could be 10 milliseconds, 100 milliseconds before the disk B comes back. So you don't want your web server stuck because a whole bunch of requests could come in at that time. And so there's a bunch of different solutions to prevent Nginx from blocking like this. One, we used the time was AIO. And since then, they've added thread pools. So in reaction to that, some folks from Nginx, FreeBSD, and Netflix came up with this thing called async send file, where essentially you don't need to sort of cover up the latency. What happens is basically, send file becomes fire and forget. And when you do send file, basically, the Nginx web server will allocate buffers, pages to store the data and attach them to the socket buffers and then initiate a disk read request. And so at the top of this diagram, we have a very primitive depiction of a socket buffer. And then TCPs is sort of the gateway here of what's good to send out to the internet. Here's the web server. And it's making a request to the disks to read some stuff. And it's allocating pages in the socket buffer. And now these pages have been allocated, but they're marked not ready. That's what those little stop signs mean. And once the disk read comes in, in the context of the interrupt handler, we go ahead and we mark all these pages in the socket buffers ready. And then finally, when they're all there, the TCP is allowed to send it, and then they go out in the layer. And the nice thing about this is that this is all done with existing kernel threads. Since the marking of ready and the triggering of the TCP stack is done out of the disk interrupt handler. And that was a really clever solution. And it's tremendously useful. So now, what's kernel TLS? Basically, I've talked about this at every conference for the last three or four years. We moved bulk crypto into the kernel. And the whole reason we did this is to preserve the send file pipeline. And so now, here's the original diagram we started with. And with send file and kernel TLS, we can get rid of a bunch of these arrows. And now things become a whole lot more possible. We still have four of the arrows, which means we still have 200 gigabytes a second of bandwidth we're using. So that means that we still can't get the full 400 gigs, but we're in a lot better shape than we used to be. Now, what's Nick kernel TLS? Basically, that allows us to remove these remaining two lines, because then Nick is all of a sudden going to do the crypto. And now, we're down to needing 100, whoops, I just hit a bunch of arrows. Now we're down to needing 100 gigabytes of bandwidth, which is less than 150. And so now, 400 gigs is possible on this system. And so now, here's where I start making things worse. So what happens if we disable kernel TLS and these things send file? I was expecting, basically, elevated CPU and memory bandwidth usage. But what I actually found was lock intention in AIO. So we got to about 40 gigs, and we were just stuck spinning on locks. So you can see the ideal performance metric of 7.1 gigabits per second per CPU down to way less than 1 with AIO. And the flame graph, I was talking about how plateaus are bad. And whenever you see these plateaus with lock delay, you know you've got lock intention problems. And you're spending all your time just spinning on locks. So let's move away from that. Let's not solve a problem with AIO, because we're not really going to use it. So let's try the Linux solution. The solution's more commonly found in Linux of using the async thread pools. So this did a little bit better. This got to be at 90 gigs or so. And this is more what I was expecting, because we have the additional data copies. And so I was expecting to see a lot of time spent access to memory. And that's what we see. So you notice that it's way better than AIO. We still know we're near as good as the optimal configuration. And now you see these plateaus are copy in, copy out. And there's a mem copy in NGINX. And it's doing the crypto. So there's a lot of these plateaus are just basically accessing memory. So that's kind of what I expected. Now for the next trick, we disable send file, but we still use kernel TLS. And this is somewhat surprising that it actually got worse. But the reason it got worse is because, at least my assumption of the reason it got worse is because when we're doing the crypto in user space rather than the kernel, the data that's being encrypted is hot in the cache from being read in from the kernel. And so this is, again, a little bit worse. And you can see still the same lots of plateaus. But you also see we're doing kernel TLS. That's what this KTLS stuff means. So then let's try to make things better. Let's disable send file, but we use kernel TLS. And so now we're back to about 95 gigs. And in addition to some lock intention, we're still accessing memory more than I thought we would. There's some extra mem copy in Nginx for SSL, which I don't understand and I haven't tried to hunt down. But if I was actually using this, I would hunt it down. But for the purpose of this talk, I just kind of recorded it. And so you can see it's a little bit better than it used to be, but still not great. So it just sort of goes to show you you can have all these great things like kernel TLS or Nick kernel TLS. And it won't help you at all if you don't use it right. And so again, lots of plateaus for memory copies and copy and copy out. And for some reason, Nginx is doing a huge mem copy, which I don't understand. Same thing I would hunt down if we were actually using this mode. Now I'm going to talk a little bit about how we run software kernel TLS. And we don't run software kernel TLS just out of the box from previous DB install. The ISALK mod. And what's ISAL? Basically, Intel wrote some really hand-coded assembly that can do a bunch of stuff like compression for storage. And it was originally intended for storage, but it also does bulk crypto. And the thing that we like about this is that it uses non-temporals. And so what that allows is it allows you to avoid this read-modify-write situation, where if you're trying to write, say, eight bytes to a 64-byte cache line, the way a CPU works is it will read the data from that cache line into memory, insert the eight bytes somewhere in the middle of the cache line, and then write it back. Well, the problem is that it's going to read all those things. It reads reading that for no reason, because we're really going to be replacing all those cache lines. And non-temporals allow us to just write it. And so that saves us a memory bandwidth. So you can see that this next one is before we discovered ISAL, and we're running with send file kernel TLS, but no ISAL. And you can see it's less than ideal, but we don't have the thing we need to compare it with, which is, if we do run with ISAL, so we go from about 180 to about 240 gigabits. And you can see in the flame graph, the difference is we have the AES, the GCN that's in the kernel. And there's some extra mem copies, again, which if we were using this, I would hunt down. But for now, I'm just recording them. But you see with ISAL, it's a different crypto routine, and there's no mem copies. So now we're kind of getting out of my comfort zone, and I'm going to talk about some virtual memory optimizations. And one of the most important things we, optimizations, at least to Netflix, to our workload, is the basically the UMA per CPU page cache. And this was, I think, originally conceived of at Netflix, at least, and it was probably independently conceived of at other places. Randall Stewart, and I think Scott Long, wrote this sort of pre-UMA hand-coded thing. I think it may partially have been up for my salon. I don't really remember. But the idea was there was a per CPU pool of free pages. And as you'll see later, one of the biggest sources of lock and tension we have is accessing the page queues. And so if you can bypass those queues and just not take any locks and just have a per CPU pool, then that should give you some better performance. More recently, it was up-streamed when it was, I think, Mark Johnson did the work. And it was basically managed by UMA, because UMA is really good at having per CPU caches. And the critical thing is it only works for free pages. It doesn't work for things in the inactive queue or the active queue. So it's getting limited to free stuff. So for the next experiment, I'm going to disable the page cache. Basically, everything is configured optimally with Nick, Colonel, TLS, Sunfile. All this stuff we were talking about in the previous section is enabled. And you wind up with some really bad performance, even worse than some of the things in the first section. And that's because you've got all this lock and tension when we are freeing pages and allocating them. So another incredibly useful optimization for Mark Johnson was VM batch queues. And basically, VM batch queues are essentially a way to relieve some of the lock and tension on all the different page queues. Instead of taking a lot, processing one page, adding it to a queue, and then releasing the lock, what happens is you store up in a per CPU area a number of pages. And then once you finally reach the maximum number of pages, you take the lock and then push them all as a group to the page queues. And so when you disable this, at least with our workload, you lose quite a bit of performance. So you go from 375 to about 280 or so with CPU maxed out. And things look better than before, but you still see lots of lock and tension when we are freeing things to the, I believe it's to the inactive queue here. You see the page queue batch. That's just because the way I disabled this was I hacked the size to be zero, so that it was always just submitting things as soon as it got them. This batch size was zero. Another really important optimization for us is SF and no cache. And that is a flag to send file. And the idea is that it causes send file to just directly free the page into the free list or the per CPU free pool rather than to try to go through the inactive queue. And we do this because we have a gauge of how popular various content is. And so if somebody is watching something that we don't think anybody else is going to watch that same chunk of that same video and that same encoding in the next couple minutes, when the client requests it, we have the web server market with SF and no cache. And that means, like I said, that we just throw it away. And so that relieves a lot of contention on the inactive queue by making better use of this per CPU free pool. And so if we disable that, again, we wreck performance. So we go down to 120 gigs to 55% CPU. One of the things that's interesting here is that in most cases, when we have bad performance, the CPU is maxed out. But it's not maxed out in this case. But the problem is that the clients are just running away in terror because Nginx is not answering their requests in a timely fashion. And so we get to about 120 gigs and the Nginx keeps getting delayed by lock and tension. So we keep going somewhere else. And here's kind of a running table of how we can break things. So with no page cache, it's really terrible. No batch queue. It's not great, but not horrible. And with no cache, it's pretty bad. And again, here with no cache, you'll end up seeing lots of, even with batch queues, you end up seeing lots of contention on the inactive queue. And you wind up spinning on locks again and spending all the kernel time on that. So I'm going to segue to a different architecture. So we've recently, in the last year or two, been playing with ARM64. And if you notice a lot of our flame graphs, we spend a lot of our time in page management. And one of the things that was just enabled in ARM64 in the last few months was 16k pages. I just want to thank Andrew Turner for that. And also Warner and Chuck and Kirk for figuring out some weird implications of 16k pages and the file systems and the disk drivers. And so by using 16k pages, we see a huge performance improvement on our ampere box. We go from about 345 gigs to about 368 gigs. And if you notice, the CPU is basically maxed and is down to having substantial idle time. And I didn't write out all the details of this machine, but this machine is essentially identical to the AMD, except it's an ampere, 80 core, 3 gigahertz, 128 gigs of RAM with sufficient TLS off load knicks to be able to push 400 gigs. And so just to compare the same metric I'm using, this is the first time we've seen it on ARM. We go from a little over 4 to around 5 when we enable 16k pages. And just to show you the difference in the flame graph, this is 4k pages. And you see time spent in the VM page functions. And we see some big plateaus, big plateaus. And then if you go to 16k pages, notice how everything got a little bit narrower. That's us recovering CPU time and kind of doing more work with less effort, which is always a good thing. All right, so now I'm going to talk about some network stack optimizations that are pretty important to us. And some of these are pretty basic things. Like one of the most basic things is LRO. So TCP is pretty expensive to run. So when a packet comes in off the wire, there's a lot of processing that needs to happen. And one of the things that I did the first LRO in FreeBSD, and it was very simplistic, it was later moved out of the driver I maintained and into the actual TCP stack. But basically, I did the initial LRO because I was working for a 10 gig NIC company. And with a standard 1500 byte MTU in 2006, you couldn't receive anywhere close to 10 gigs. And with LRO, basically it batches up packets as they're being received. So that if you're receiving, it works on eight connections. And it finds a packet for the first connection, appends it to a chain, but the next packet for that comes in, it gets appended to a chain. And you keep building these packets on these eight connections. When the chain gets to the maximum size, or you start seeing other connections, you flush the chains. So that enabled us on some horribly weak 2006 single core AMD box to get from, I think it was like 2.5 gigs a second to 10 gigs a second with some idle time. And there's been a lot of advancements made to it since then. But the whole advantage of it is, essentially, you avoid running the TCP stack eight times, 16 times, 40 times, whatever your maximum aggregation size is. And that just saves a lot of CPU time. And so if we disabled LRO, we see a little bit of a drop in performance. We see about 330 gigs at 65% CPU. And one of the things I noticed is that when we do this, the NIC starts to drop some packets. And so that impacts our health. I haven't talked about that. That's kind of a net look specific thing. And basically, health is kind of a signal to the load generator that we're not healthy, and it should send traffic elsewhere. And so lots of things will impact health. High CPU usage, dropping packets, high disk latency, high engine X response latency, whatever. But basically, by dropping health, it's telling clients to go somewhere else. And so with LRO disabled, we reach a max of about 330 gigs before the clients start going somewhere else. And you can see that's probably one of the optimizations that has the least effect because it's pretty much the highest bar of anything we've seen so far. And here is a flame graph. And it shows spending a lot of the time in the TCP stack out of the network interrupt handler. And so the next thing we're going to talk about is RSS accelerated LRO. And Hans Peter came up with that from LNOTS. And if you remember, I was talking about LRO. It's got a limited number of connections it can process. It keeps track of like eight connections at once. Well, in a workload like Netflix's, we might have a NIC with, say, 32 queues. And we might have, say, 100,000 connections. And so that's roughly 3,000 connections per queue. And so the odds that a whole bunch of packets on the same connection are going to arrive close enough together so that they can be processed altogether is pretty minuscule. So Hans came up with this super clever idea where what you do is you gather all the packets up in a big bunch before you submit them to LRO. And then before you submit them, you sort them by the RSS hash that came off of the network card and their arrival time. And so what happens is you transform this mishmash of packets that are all from different connections into runs of packets from the same connection. And when you do that, LRO can actually process them. And essentially, if we disable this, I mean, the number's a little bit different. We get to 70% CPU. And here's one of the reasons why I've got this metric. Because you can see that this is essentially if we disable all of LRO, we're here. And if we disable just the RSS assist, that makes LRO pretty ineffective. And it's almost the same as disabling. You can see this big gap between these two, which is bigger than the bandwidth gap would indicate. And again, we see lots of activity and the TCP stack coming out of the network interrupt handler. And another thing which is super important is TSO. And so the principle is the same. The difference is that this requires cooperation from the network card. And pretty much all network cards, since the early 2000s have supported this, even the cheap things. And so the idea is that like LRO, we just reduce the number of trips to the network stack. And so the TCP stack decides he's got the space to send, say, 64K to a client rather than looping through and sending one packet, 1500 bytes, one packet, 1500 bytes, 40 sometimes. It just sends a huge, gigantic 64K packet down through the lower parts of the network stack to the network driver. And then the network card itself is responsible for replicating the IP header and replicating the TCP headers and splitting that 64K TCP chunk up into a bunch of valid 1500 byte Ethernet frames. And so on the host side, that avoids a lot of trips through the network stack and also avoids allocating a lot of unbuff headers and a lot of, you know, queue manipulations in the socket buffers. And so if we disable TSO, we get to about 180 gigs at 85% CPU. And this number took a little while to get because I needed to disable interrupt coalescing just because we're putting so much more pressure on the transmit descriptor rings. And as you can see, there's a lot more time spent in, oh, I actually, oh no, there it is. You can see a lot more time spent in the network driver. I don't think we've ever seen Ether output and the Mellanox transmitter chains showing up this prominently. That's just because, like I said, we're calling them probably 10, 20 times more frequently. Because you don't always have 64K things to send and the network guys hate it actually, actually sending 64K at once. They like to send at most, say, eight because they worry about buffers overflowing on the client side. So now let's disable both TSO and LRO to reduce all of our batching to zero. And that takes it down a little bit. We get down to 170 gigs and again the same IRQ coalescing constraint. And so this is no TSO and this is no TSO and LRO. And you can again see lots of network stack stuff, lots of prominent Ether output and Mellanox driver calls. And everything is much worse just because we're running things so much more often doing so much more work for no reason. Now I'm gonna get to the interesting part of the talk, the headline part of the talk. So last year I really wanted to give some results on an 800 gig box and I couldn't because it got stuck in some kind of shipping snafu. But that was a year ago and so I've had a year to play with the thing. It's a Dell R7525 with two Milan 64 core CPUs. And the important thing is that it has three links between the sockets and that will become important later. It has half a terabyte of RAM and enough Mellanox Kinect X6 Nix to drive 800 gigs and enough storage to feed it. So the first time we fired the thing up we got 420 gigabits out of it which is not very impressive. And if you remember from my Numa talks from a few years ago there's a bunch of different ways you can run I can configure Numa machines to run the Netflix workload. And we're running what I'm calling the network siloing mode where basically connections are hashed by the incoming Nix to a particular Numa node and we try to do all that work local to the node that we can. So during this test the CPUs were mostly idle but we were dropping packets like crazy. And so what AMD guessed was that the links between the CPUs were downlinking to buy two just to save power. It took the CPU load as a signal for it should be saving power. And so you can force them to not downclock or downtrain. And by doing that it's called dynamic link with management. If you turn that off we get up to 500 gigs. And that's again network siloing mode. And the important thing to know here is that means that the data from the NVMe drives is getting directly DNAed from the Numa node where the NVMe drives are to a different Numa node where the Nix are. And one thing we noticed using AMD's profiling tools which are available on their website by the way so anybody can use them is that the XGMI link was very, very unevenly utilized. There were 15 gigabytes a second on one link and then 4 gigabytes a second on another and 2 gigabytes a second on a third. And so what AMD eventually told us was that when you are doing DMA across the Numa bus which we're doing because we're doing the network siloing then the XGMI link that's chosen based on the location on the originating CPU socket of the device doing the DMA. And the problem is that there's essentially four I.O. chips in a I.O. quadrants in an AMD CPU and it hashes by like what quadrant the I.O. device is in. And unfortunately with this Dell things are very uneven and most of the NBME drives are in the same I.O. quadrant which is why we're seeing such even use of the XGMI links. And so one of the things we can do to improve this is remember how I said we're DMA'ing to the Nix Numa node. If we decided to DMA to the NBME drives Numa node then this goes away and then Nix are much more evenly distributed and so when we're hashing based on the Nix location then the XGMI links are going to be more evenly utilized. So that's what I did. I flipped things so that now we're doing DMA directly to the node that NBME is on. And that helped things a little bit. That got us up to about 670 gigs. And like I hope the XGMI was much more even at 10, 10 and 7. Now doing this is problematic for a lot of reasons and I'm going to make things more problematic in a second and then I'm going to go over all the problems. So now at this point I decided to turn on full dis-centric styling which if you saw my talk from a couple of years ago it basically means that in addition to DMA'ing stuff to the Numa node where the Nix live I actually migrate the TCP connections to that node. So we essentially tear down KTLS on the original node and then we move the TLS connection to software and then we move the TLS connection to hardware on the new node. And so now no matter what node the TCP packets are coming in into they're always going to go out of the same node that has the NBME storage for the connection. So the idea is that there'll be like zero Numa crossings for bulk data which will help with this XGMI stuff. Now there's a lot of problems with doing it this way which is wondering why I would never actually run it this way in production because when you're running the network siloing mode the traffic is basically a distributed or sharded by the hash that's done by LACP to the NIC ports and that's a hash over the IP and port ranges and so that's a hash over this huge hundreds of thousands or millions of possible things and so the hashes end up being really, really evenly distributed so your load is very, very evenly distributed between Numa nodes but all of a sudden if you do disk centric siloing it depends on where your content is and that's you're choosing between say 8 or 16 NBME drives and 8 or 16 is a lot less entropy than hundreds of thousands or millions and so if you have a title that's really popular then all of the traffic for that is going to end up on one Numa node if you're doing disk centric siloing but if you're doing network centric siloing it would be distributed all over even across the Numa nodes so the problem with this is if you have a really popular title or you have just a hot disk then that will translate into a hot Numa node and you'll have maybe one Numa node which is reaching bandwidth limits on the NICs and another Numa node which is just loathing along and I mentioned we move TLS sessions that's kind of expensive, that's something I'd rather not do and the other problem is that once we do this affinities for a lot of things are now wrong especially we have TCP acts coming in on the wrong Numa node it's getting processed across the Numa bus which adds latency to the processing the TCP pacing is being done in the wrong node I mean that's something I could probably fix if I had to but I think I already talked about the uneven sharding but we got some good results from it just for the science experiment so we can get to 731 gigabits in this configuration and that's because doing the disk centric siloing almost all the load off the XGMI links except for just CPU cross traffic and at this point we're limited by network output drops not CPU and the cause of the drops is basically due to those disk centric siloing problems I was mentioning in some cases one node is just being pushed too hard with respect to how much memory is being used the other thing I didn't mention is that if there's a hot node our software which decides what content is popular isn't Numa aware and so it will make this cash or no cash decision just globally based on how much memory is in the system not on how much memory is on each Numa node you wind up with the page demon kind of going crazy on one node because it's not enough stuff is being directly freed and the other problem is like I said with hot content you wind up with one node doing more and one node doing less so all the nicks on one node in this experiment were doing 94 gigs and all the nicks on the other node were doing 89 gigs and so I think that's about it and I have time for questions now I have way more time than I thought I think I must have rushed through this because I was worried about the whole thing fitting yes I can repeat the question if I can hear you that's actually one of the reasons why I give this talk is because you know quick is essentially let me go back to the slides quick is essentially I probably should have done an explicit thing so let's see where's my quick right now quick is basically the dumb way because there's no send file for it everything's happening in user space but we're still reading things through the kernel and so quick is essentially the TLS in user space case and so that destroys the performance also and the other bad thing about quick is that it's the combination of that with from the networking section with TSO and LRO disabled now there's some hardware vendors working on quick offload I was actually talking to a small company which I think last time I talked to them was last winter and I think they've actually gone bankrupt out of business unfortunately but they actually had a quick offload solution which did essentially kernel TLS and like send file for quick but until you know solutions like that get popular I have really huge concerns about quick for performance reasons is there something preventing you from duplicating hot data to other numous zones I'm sorry I missed the critical part of the question is there something preventing me from what duplicating data to other numous zones that's hot data hot content can you take your mask off just for a second is there something preventing you from copying hot content to other numous zones no there's not it's basically development work and the way we view these NUMA platforms now is essentially as research vehicles to see what problems we're going to hit internally with our infrastructure and with our other software when we get to these bandwidths so for example this machine is essentially a prototype of a future looking machine based around the next generation DDR5 which more than doubles the memory bandwidth and PCA Express Gen 5 and so we were hoping that I would have results from a machine like that to share because I really didn't want to talk about NUMA and unfortunately a lot of this became about NUMA but we just don't have those machines yet we don't have the PCA Express Gen 5 you know 400 gig NICs yet we don't have these platforms based on DDR5 yet but the interesting thing that we found with this science experiment is what was really killing us was two things was logging and with some local stuff we have to do socket buffer resource management so we sort of treat NGINX as a one NGINX worker and we give him a quota of how much socket buffers he can use and that code was kind of getting a little bit in the way along with some just the way we do logging what I wanted to ask is so you said that with the disk space disk based siloing you have the problem that the incoming connection might be on the wrong NIC and you have costs with moving it and probably stuff going on going out on the wrong NIC so have you thought about exposing that with multiple IPs and making the network aware with BGP or something else aware on which NIC it should be interrogated for specific something? That's one solution but the IPv4 address space is limited and these are all internet facing addresses so we'd rather not burn more IP addresses and I mean there's a lot of solutions that we could use if we really had to and I said this is just a our research vehicle is not designed to pre-production and in fact another thing we could do is that after my first talk in 2019 about NUMA somebody came up to me and made me aware that there were extensions to LACP where you could notify the router that you wanted to move the connection we don't have it implemented in all routers firmware at least as far as our network team was aware so it's not a solution I could use today but there are a lot of things that could be done to make this centric siloing real given its current status as a research vehicle it's just not worth the extra effort that's one of the reasons why I've never upstreamed any of this stuff because I upstreamed some research stuff in 1999 which for zero copy sockets which ended up haunting me for years so no more research stuff and upstream anything else? final chance alright thank you