 So, my name is Drew Gallatin, and I'm from Netflix, and I focus on basically CPU efficiency and trying to serve as many customers as we can from as little hardware as we can, both to save the environment and to save ourselves some costs. So for the last little bit, I've been trying to push the envelope with serving as much as we can from a single box, and I'm here to talk to you about getting as close as we can to serving 400 gigs from a single server on FreeBSD. So I'm going to talk about, you know, why we want to do this, and I'm going to describe a little bit about our production platform and talk about the workload that we have and then talk a little bit about whether or not we need NUMAR or not to achieve these goals. Talk a little bit about hardware inline TLS, which is in FreeBSD we called NIC TLS. And then if I have time at the end, I'll loop around and look at some alternate platforms in addition to the production platform that I'll be talking about for most of the talk. All right, since 2020, we've been rolling out some servers that have been able to serve 200 gigs of TLS encrypted video. So that's really cool. The thing is, these servers actually have four 100 gig network ports, and we've always thought, well, wouldn't it be nice if we could actually light them all up and serve 400 gigs from these machines rather than two? So let's look at our workload a little bit. So we run FreeBSD current. We're generally, you know, a few weeks behind head, not too far. We run the NGINX web server. And we serve our video almost entirely via send file, and that means that we love kernel TLS because that means that the data is encrypted in the kernel as part of the send file pipeline, which means we don't have any extra boundary crossings where data is pulled off a disk and encrypted user space and then written back to the kernel, none of those detours. And just by itself, software kernel TLS saves us roughly 60% CPU. The production platform I keep referring to is an AMD Epic 7502P, which is otherwise known as a ROM. It has 32 cores at two and a half gigahertz. And the more important thing is it has eight channels of DDR4 and 3200. And that's enough for 150 gigabytes a second of memory bandwidth. And now through this talk, I'm going to be talking about networking stuff, which is gigabits, and like storage slash memory bandwidth stuff, which is typically gigabytes. I'm going to try to remember to go back and forth by that factor of eight. And if I forget that, I apologize. But I'm just going to try to do both things here and mention that that's about 1.2 terabits of bandwidth in terms of memory bandwidth. We have plenty of IO. We have basically 128 lanes of PCIe Gen 4, which is about two terabits in networking units. So all that sounds good. And like I mentioned before, we have two Melanox connect X6 DXNICS. They connect at Gen 4 and by 16, and they have two full speed ports each. So that's four 100 gig ports. And hence our motivation, you want to use all four of those ports. And we have a whole bunch of NVMe drives and these machines respect out way before we fund a full speed Gen 4 drive. So we have just a lot of Gen 3 drives. So our initial performance results on this platform are about 240 gigabits a second, and this is with software kernel TLS. And we're limited primarily by memory bandwidth. And the way that I know that, I'm not just mystical. Basically AMD is given as tools and I think you can get this tool yourself if you Google for it, AMD Uproff PCM. And it reads the system counters and it tells you what the memory bandwidth is. So the gist of it is that we reached the rated system memory bandwidth at about 240 gigabits a second and at that point, things start to collapse. So to understand where the bandwidth is going, I'm gonna bring you back to this slide that should look familiar if you saw my talk two years ago. In fact, a lot of this next section is gonna look familiar if you saw my talk two years ago, so I apologize for that. Basically, now I wish I had a laser pointer, I was trying to play with my mouse and it just doesn't work well enough to see. But essentially the data flow is that a Netflix client, you're watching a movie. Netflix client has an algorithm where it says it wants to fetch so many megabytes ahead. So it says it wants to fetch the next two megabytes. And Nginx gets a request for that. And it launches a send file request to satisfy it. And what happens there is that the data is brought in from those this NVMe drives in the lower left-hand corner. And in aggregate, it's read 50 gigabytes a second. And it's read into system memory. Now, once it's read into system memory, we have to encrypt it to send it on the wire. So the CPU then reads it out of memory again, encrypts it, and then allocates a separate destination buffer, which is kind of an important side, basically, most crypto does crypto in place. But because we're talking about files here that live in the page cache, we cannot encrypt in place, because if we did, then that would, if I was streaming something and you were streaming something and we encrypted it to my TLS keys, then you would get something that was encrypted doubly and it would just be garbage and it would be a bad day. So we have to encrypt to a separate TLS crypto buffer. So once we do that, we write it back out at 50 gigabytes a second. And then the network card comes along and DMAs it and sends it out on the wire. And that's another 50 gigabytes a second. So if you add all those things up, that 50 gigabytes a second gets multiplied by four. So you wind up with 200 gigabytes a second or 1.6 terabits a second of a bandwidth that you need to achieve your goal. And obviously we don't have that much memory bandwidth. So one of the things I was wondering is can NUMA get us closer? So one of the things NUMA does is it makes when you run AMD machines in NUMA mode, and I'm gonna explain what NUMA is in a second. But if you run AMD machines in NUMA mode, you can end up making more efficient use of their memory controllers. And so if you look at stream results on the web, you see that for the system that we have, people generally get about 150 gigabytes a second in flat mode. But if you run the AMD system in four nodes per socket, you get about 175 gigabytes a second. And so what is NUMA? So NUMA is non-uniform memory architecture. Now, these slides are basically dupes of what I did two years ago. They're a background, if you already know this, then get a cup of coffee. Basically, it means that memory or devices can be closer to each other. So if you go back in the early days when we built big multiprocessors, you could just plug in CPUs and memory and disk wherever you wanted to. Everything was equally close to everything else. Software people just didn't have to care about where things were, but it didn't scale very well. So hardware people eventually realized that the best thing to do is to divide and conquer. And so what they came up with was this NUMA system where, essentially, you can tie multiple computers together with the NUMA bus. So if you look at the system on the left and the system on the right inside those blue circles, essentially each one of those systems is almost a complete computer. It has a network card, it has memory, it has disks, and it has a processor. And ideally, you wanted to stay within that blue circle. If you cross that red dotted line, then you wind up crossing the NUMA bus and your access to things on that side of the computer ends up being slower. Different NUMA buses have different efficiencies. The penalty could be almost nothing or it could be really severe. It just depends on what you're talking about. So it gets even more complicated with the AMD epochs because the most efficient way to run them is to run them in four nodes per socket mode. And so now we've just sort of multiplied the problem. And instead of having two NUMA domains, now we have four. And luckily, they're all interconnected. So you just get a picture of how complicated things get. And the latency penalties across these dotted lines, depending on which ones you cross and what you're doing, can be anywhere from 12 to 28 nanoseconds. So it's not free. And the bandwidth limit for the system that we have is about 47 gigabytes a second per link, or 280 gigabytes a second total. So basically, the strategy that I came up with a few years ago is to keep as much of our bulk data off the NUMA fabric as we can. Essentially, you can sort of think of if you have a highway and you don't want to fill it up with big semi-trucks and then block the commuter cars. Basically, if your highway is full of this bulk data, like these big semi-trucks, then the commuter cars can't get through on their urgent missions and you wind up blocking the CPU while you wait for small things to get through. And I'm talking about even simple things like, updating a counter that lives in a different NUMA domain or maybe a process needs to read something from some other NUMA domain. You wanna keep those links as idle as you can. So I'm gonna go through and talk about kind of the data flow and the worst things that could possibly happen. So in the worst case, a request comes in and let's say it comes into the lower left NUMA node. And I really wish I had a laser pointer because this makes this a lot better. Maybe you can see my mouse moving. The request comes in here and we happen to have the content in a different place. So we asked to read the content and the content goes into the node on the upper right-hand side because we allocated the memory in the wrong place. Since it was actually in our local node, we probably should have allocated memory here but we were just too dumb to do that. So now we crossed the NUMA bus. And now once it's in memory, the CPU needs to read it to encrypt it. And so ideally we pick a CPU that was up here and the same node, but since we're being stupid, we pick one that's on the lower right-hand side and we cross the NUMA bus again. Now after we encrypt it, we need to put it somewhere. So we have to allocate a destination crypto buff. And once we, if we were being smart, we'd allocate it right where we are, but we're not. So we allocate one on the upper left-hand side. And after all is said and done, we need to send it out and that network card is in the lower left-hand side. And so we crossed the NUMA bus basically four times and that's pretty close to the aggregate fabric bandwidth and that causes the fabric to congest and that's bad. So we don't wanna do that. We have to do something a little bit smarter. So in the best case, in the ideal case, we just stay on that lower left-hand side there and never leave it. And that would lead to zero NUMA crossings and nothing on the fabric. That's where we wanna get, but how close can we get? And basically we're constrained to using one IP address per machine and we're using LACP for bonding. So I came up with two ideas back in the day. One is to sort of focus on where the content lives. I call that disk-centric siloing and one is to focus on where the connection came in. And then I call that network-centric siloing. And basically what the LACP partner chose for is essentially when you're using LACP, the link partners will hash on their own criteria typically based on TCP ports and IP addresses to get a good sharding of traffic to all the link members. And it's consistent as long as the lag isn't flapping and members coming and going that hashing is consistent. So you wind up with the same connection going to the same nick all the time. So what I moved forward with was the network-centric siloing. And I actually fleshed out disk siloing and I have some backup slides at the end. If I have questions about that, I've run that to completion and I believe that network-centric siloing is by far the best way to go. So basically you allocate the network connections with affinity to the NUMA nodes. You allocate memory to back the media files when they're DNAed from disk local. You allocate local memory for TLS crypto and you run TLS workers, TCP Pacers, TCP timers and all that kind of stuff with domain affinity. And you choose a local nick to send the results back to the client. So all of this is actually upstream. I think it might even be in 13 and we run current, I don't track 13 very well, but I know all of this stuff is actually upstream. So now let's look at the worst case when we're making these better decisions. The same scenario as before, essentially the network connection comes in on the lower left side. This time though, the content lives on the right side just so I have something to show you. So what happens is we allocate a local buffer on the lower left hand side to destroy the content when it comes in from disk. Now when we go to encrypt it, we choose that same local CPU. Once we write it back, we've allocated a crypto buffer that's local and then we send it and we write it back there and then we send it back out on the network card that it came in on. And boom, we've only crossed the Numa bus once rather than four times. So basically the worst case is we cross it once. We only put 50 gigabytes rather than 200 gigabytes and all that's good, that helps a lot. The real problem though is that real life is messy. Remember I said those NICs were two by 100? So that means that you can only put two NICs and Numa domains because you have two Numa domains that don't have NICs because there's no need for them because we already have 400 gig links and NICs are expensive and we don't want to replicate them all around. So right now I have hacks to sort of pretend that one of the ports is in a different domain and this impacts the worst in the average cases in the following way. So basically that NIC really isn't down on the left hand side anymore. Actually it's like up on the upper left hand side. And so rather than going right out that same Numa node at the end you actually have to go across Numa again and that impacts the worst case by now instead of putting 50 gigs on the Numa bus we're putting 100 gigs. Now if you look at the average case it gets a little bit better. So if you think about it, if you have four nodes then there's basically a one in four chance that the content's gonna be local. So we're a three in four chance that it's not gonna be local. So you have a 75% chance that you're gonna cross the Numa bus there to find content that's not local. And then since we have two domains with NICs and two domains without, you're gonna have a 50% chance you're gonna need to cross the Numa bus to go out on a NIC that actually exists. So that leads us to about 62 and a half gigabytes of data on the Numa fabric which is still a whole lot better than the number of the 200 gigabytes that we would get if we weren't being smart. So did it help? Yeah, it helped a little. And it was no panacea but it gets us from about 240 to about 280 which is better than nothing but still not anywhere near our goals. That might make it worth lighting up a third link but definitely not lighting up all four links. So can we do better? And when I'm asking that question basically I'm looking at NIC at KTLS. So if you remember that diagram back at the beginning of the presentation these vertical green lines have always bugged the halotomy. So if you can see my mouse moving around these lines here and here where essentially the CPU has to read the data out of has to read the plain text data out of memory, encrypt it and then push it back. And it doesn't even need to be the CPU it could be a QAT or a Chelsea OCCR or a look-aside card, whatever it is it's got to read the data out of memory and write it back. And that's the part we want to get rid of. So if we could get rid of this extra kind of detour through the CPU or through an accelerator card we could essentially cut our memory down with requirements in half and things kind of drop down to the case where it's almost like things are unencrypted. Data flows in from the disks and the memory and out the NIC and the CPU just kind of just stays uninvolved. So let's talk about this magic thing. What is NIC-KTLS? So basically it's what the rest of the world calls hardware inline TLS. So normally in KTLS sessions established in user space that's still the same but now when cryptos move to the kernel the kernel kind of turns around and just does the hot potato and hands the crypto off to the NIC. And so the TLS records are encrypted by the NIC as the data flows through it on transmit. So the positive things is there's no more detour through the CPU for crypto and we reduced our memory bandwidth requirements. The negative thing for some people is that all of a sudden your data which used to be encrypted in user space and then so it was always encrypted in the kernel now with KTLS it's in the clear in the kernel and now that moves that even further away so it's in the clear and the others going across BCI Express and I realize some people may have problems with that for this workload I don't but it's just to put it out there. So the NICs that we're using for this are the Melanox ConnectX6DX. They offload TLS 1.2 and 1.3 for AESGCM that's by far our most popular cypher and one of the most popular cypher's on the web. So the thing to remember about these NICs is that they retain crypto state within a TLS record and that means that TCP can send the first couple segments off of a TLS record and stop for a while because it's waiting for an act and then a couple milliseconds, a couple seconds whenever the act later the act can come in and then once the act comes in TCP can pick right back up in the middle of that TLS record and the NIC will have remembered the crypto state and things will just keep on going as normal. The bad case is if a packet is sent out of order a TCP retransmission because at that point the NIC no longer has the state from some arbitrary point in the TCP stream and so it needs to do some extra work to recover that state. So I'm gonna go through a little diagram which kind of describes how transmits work with the ConnectX6DX for TLS. All right, so this I chart here essentially what we've got going on is in the that upper rectangle is a representation of host memory with a TLS record unencrypted sitting in a socket buffer. The numbers you see are essentially TCP segment boundaries. So you notice it starts at zero, 1448, 2896 and so on. And then you see the PCI Express bus representation of the NIC and then 100 gig network. So when we go to send something, basically the NIC will, TCP will say, hey, I wanna send the first four, five, six segments of the packet and then the NIC will go and DMA them. It may do something smarter than this. I don't know how their NIC works. It may DMA 4K at a time. It may DMA the individual segments. I have no idea. That's all hidden from us. But the NIC DMAs the DMAs to plain text data down, it encrypts it in the NIC and then it sends it out on the wire. And obviously it sends it as packets, not just a big burst because, 15 megabyte MTU, but my artistic abilities are limited. So that's what this diagram looks like. Now, TCP comes along later and says, hey, I wanna send the next five segments from this packet. And so then the NIC, as you would expect the NIC DMAs them down and encrypts them and it sends them on the wire. And then time passes, we get some more acts, more more window spaces freed up. TCP says, hey, I wanna send that last segment. And so the NIC DMAs are down and encrypts it and sends it on the wire. That's all well and good. And you can see that the NIC doesn't have to do any extra work when he leaves off in the middle. It's just all straightforward. The TCP could do the whole segment or TCP just to a part of a segment, it doesn't matter. But for a retransmit, let's say that your cell phone was flaky and you lost that one chunk out of the video stream and you send a sack for it and TCP needs to resend that segment at 14K. So what you would expect is, you'd expect to see something like this where the NIC just DMAs down and encrypts it and sends it on the wire. But no, what has to happen is because the NIC no longer has the state for this arbitrary position in the TCP stream. What has to happen is the NIC has to DMA the entire TLS record and all the way down to the NIC. It has to encrypt all of that up to the point of the current segment in order to get the crypto state. Once it gets the crypto state, it can encrypt that one segment we wanna send and it can send it on the wire. So that's a little bit painful and that means that we have to watch out for TCP retransmissions because they could easily run without a PCI Express bandwidth. I'm gonna talk a little bit more about that. So first I'm gonna just talk about some results. I mean, this is basically a talk about war stories. There's no new tales to tell here. So basically we got a very early beta firmware from Melanox. I think it was like, it was before the pandemic and we had pretty good peak bandwidth. We got about 250 gigabits a second total out of the system. The problem is that the sustained bandwidth was pretty terrible. So what happened was basically the way our systems work is that we attract load via this thing we call health which runs off of the PID controller with various inputs. A lot of the inputs are things like, is the network card close to capacity? Is the CPU saturated? Are the disks saturated and whatnot? So what happened was as we kept adding more and more sessions, the CPU was very low, network card was nowhere near its limits. And so we said, we're healthy, give me more traffic, give me more traffic. And as more and more connections arrived, the NIC slowed down more and more. And so we wound up at a floor of about 75 gigs in NIC or 150 gigs per system. And we just kept attracting more and more and more clients and it sort of turned into a tar pit because this kind of limit is invisible to our PID controller. So we really didn't like this for that reason and it was not a big enough advantage for us to use to move forward with inline TLS. Now, what exactly was happening? So all of this is my supposition. It's not anything that I know from like talking to Melox or NDA stuff or anything like that. But basically we know that the NIC stores TLS state per session. We know that because it can resume in the middle of a TLS record. And we know that we at Netflix have a ginormous number of TLS sessions active. Each client has multiple video connections anywhere between one, typically two to four to six. Buggy clients have had tens or hundreds. We have one audio and one subtitle connection. So by a rule of thumb, we kind of estimate what we need 400,000 sessions for 400 gigabits. And we noticed that the performance gets worse the more sessions we have. And my supposition is that there's a limited amount of memory on the NIC. And that I know from reading the Melox driver that it will allocate host memory and on behalf of the NIC and let the NIC do whatever it wants with that memory. So my supposition is that what's happening is that the memory on the NIC is exceeded once we have too many sessions and it's just paging in and out of the host and things just get really bad once that memory is thrashed. Don't know that for sure. Never got confirmation from Melox on that but that's what I suspect. So AMD suggested we enable PCI relaxed ordering. And what PCI relaxed ordering does is it basically allows things to pass each other. So the theory is that that would help with paging in this TLS connection state. So we enabled it, it didn't help but later we found out that the ConnectX6 firmware even though you could toggle that bit and PCI config space, it didn't actually wasn't actually connected anything and nothing actually happened because it was hard-coded off. Time passes and we get a new firmware and this one's much better. This one actually enabled relaxed ordering and they may have done some other things. I mean, I don't have access to the firmware but all I know is that this one, we got 160 gigabits per second per NIC or 320 gigabits a second total. And at this point, we were much happier because the peak and the sustained were essentially the same. The peak was a little bit higher but basically our machine wasn't turning into a tar pit. This was a new record for us, 320 gigs is a lot better than the 200 and some gigs we're getting from software TLS. And on a per NIC basis, it's nearly as fast as software TLS, 160 versus 190 gigabits per NIC. And while I'm talking about per NIC, I mean like on one of these machines, if you just disable one of the NICs and run it with software TLS, you can see that the NIC can push 190 gigabits. I mean, you can't get 190 gigabits from two because of the memory bandwidth limits. So anyway, a few months later, we thought that was about as good it was gonna get. A few months later, they surprised us with a new production firmware which had a knob called TLS optimized. I still don't know what that does except it makes things much better. And now we get 190 gigabits per second in NIC or 380 gigabits a second total. So that's a pretty good number. Let's start using it. Well, wait a second, we're Netflix, we can't actually start using things yet. We have to do QE testing. And QE is basically the quality of experience for the customer. So we measure things like the rebuffer rate, like how many times you get that little spinny when you're watching Netflix video, how long it takes between pressing play and you seeing the video show up, how long it takes between when you start playing and when the video looks decent or excellent really. And the initial results from this NIC were good. Now that's the contrast we tried to NIC in the past which did not have this feature of being able to pick up in the middle of a TLS session. That NIC, in order to make it work, we had to really Randall Stewart had to essentially hack the TCP code to become aware of TLS record boundaries and to try super, super hard to only send data around TLS record boundaries. And that caused really miserable QE. And so we couldn't move forward with this other NIC. Because this NIC can resume where it left off, the QE is fine. So we did an initial study, everything was fine. Now that we have more of these deployed, we have more experience, and we're gonna do a larger, more complete study before we enable it. The other thing we need to do before we enable it is to come up with a way to defend against TCP retransmits. So if you remember those slides way, way, way, way back when, where I was talking about what happens when we do a retransmit. So what happens here is you wind up with essentially almost like an amplification attack. So whether it's intentional or not, if you have a lossy TCP connection, it can run you out of PCI Express bandwidth fast. So you wanna make sure that this doesn't happen, this almost never happens. So the way we do that is, the way we do that is to monitor the bytes retransmitted on lossy networks and to move connections from hardware back into software once they exceed a certain threshold of retransmits. And we also monitor segments that are retransmitted in case some hacker gets clever and decides, well, I'm not gonna ask for very many bytes retransmitted, but I might ask for a lot of segments and that's still gonna make the NIC-REDMA, all that stuff. And in that case, we'll probably kill the connections rather than moving them to software since it's pretty clear it's an attack. So one interesting note here is that the mixed hardware and software performance is interesting. So I've noticed that if I move basically, and I basically set the threshold to 1%, meaning that if a TCP connection has more than 1% retransmitted bytes, he gets moved to software, then that moves roughly 25 to 33% of connections to software. And that moves our max stable bandwidth down from 380 to 350 gigabits a second. And that's a larger impact than I would expect. And that's something that I need to look into because just a third of the connections really shouldn't be putting that much stress on the system. Memory bandwidth looks fine, the DMA latency looks fine, I just don't understand why it's as bad as it is. All right, so I'm gonna, now we're gonna talk about, well, what happens when we combine these two techniques? You know, when we combine kernel TLS and NUMA, this is a sort of the you put your chocolate in my peanut butter kind of moment if you remember those ads from the 80s. So now this is the same diagram as we had before. Except things get a lot simpler. So the TLS connection comes in on the lower left hand side. And again, the data lives up in the upper right hand side and we bring it to the lower left hand side. So now instead of doing any crypto or anything, boom, we just send it out on the NIC. So everything's well and good, much simpler. Now in real life, if you remember, we don't have NICs on every node. So in this case, we might need to send it out on the NIC on the upper left hand side. Essentially, the worst case is essentially the worst case that we had before and nothing changes. And the average case is the same average case in terms of NUMA crossings that we had before, nothing changes. The only thing really the changes is that detour through memory is just completely avoided and we save some memory bandwidth. So here's a big eye chart for you of all the performance numbers that I've talked about so far. First we start off with software TLS where we get about 240 gigabits a second and we're at about 80% CPU when the memory system saturates. And then with software TLS and NUMA, we get to about 280 again at 80% CPU. Now if we run the AMDs in flat mode just with one node per socket, we wind up with and we use a hardware kernel TLS. We wind up with 380 gigs at, I don't know, about 60-ish percent CPU. And NUMA with hardware TLS doesn't change the bandwidth at all, there's still about 380, but it drops the CPU and it's just a hair over 50%. And then if we go back to the configuration that we're actually probably going to deploy, essentially we have hardware TLS in a flat topology with some safety belts to keep lossy connections from gumming things up. We wind up with about in this upper 60, lower 70% CPU at about 350 gigs. So in the, what do I have left? I've got about 10 minutes left. In the end here I'm going to talk about some alternate platforms that we looked at. One of the interesting alternate platforms we looked at was the Ampere Ultra, which is a three gigahertz 80 CPU ARM Neoverse. And if you kind of squint, it looks very similar to the AMD. It's got the same number of memory channels, same size memory, same number of PCIe lanes, roughly the same storage and the same networking. So really the only important thing that gets swapped out here is the CPU, which is a completely different architecture. And because it's a different architecture, at least on FreeBSD, I don't currently have a way to do all the fun stuff I do on Intel and AMD, like looking at memory bandwidth being used and running fancy profilers and looking at IO bandwidth being used. And so it kind of makes me feel like I'm kind of driving blind. So when we initially started playing with it, we got some really poor performance but self or kernel TLS, we got like 120 or 140 gigabits a second. Mark Johnson came up with an idea, which is brilliant and which I kicked myself for not thinking of, but all brilliant ideas are obvious in hindsight. And he came up with the idea of using a UMA cache zone to cache 16K contiguous TLS crypto destination buffers. And once we did that, we got up to about 180 gigabits a second. So that's pretty good. It's still not AMD level, but it's pretty good. And then we tried Nick TLS, the hardware TLS, and we found that we were PCI-E limited at about 240 gigs. The CPU utilization was shockingly low. It was like, you know, in the 20% range. But the NICs were saturated and started dropping on output. When I was looking at this, I noticed that one difference between the AMD systems and the ampere systems is that extended tags were not enabled. So I'm not sure who's responsible for enabling them, but on ampere with FreeBSD, they just didn't get enabled. So basically the importance of extended tags is PCI Express is almost more like a network than a bus. And just like in TCP, when you increase the window size, if you can increase the PCI Express tag space, you can have more sort of more balls in the air at the same time, which gives you a bigger pipeline and leads to bigger bandwidth. By default, the PCI Express tag space is just five bits or 32 tags. If you enable extended tags, then you get eight bits and 256 tags. And once we did that, we got the bandwidth up to 320 gigs a second. Still not AMD performance, but I don't have as much experience on this platform. I don't have the tools that I have everywhere else. It's possible that we could do better. I mean, I don't wanna sell ampere short, but this is where we are right now. And the last platform I wanna talk about is a platform I literally got into a data center like last week sometime. And it's the Intel Ice Lake Xeon. It took us forever to get this working because of storage. Basically, the difference between this system, besides the CPU is that it's only got 64 lanes of Gen 4, which means we can't use an insane number of Gen 3 drives. We have to use Gen 4 drives. Or we have to use a PCI Express switch and the switches that we were using had weird problems. So real life messy details. This is the first time we were able to build something close to a full speed system. So it has 20 kiaxia, four terabyte NBME drives. And again, the same to Melanox Nix. One important difference here is that we have eight channels of DDR3200, but this skew of the Intel Xeon runs them at 2933. I personally hate that. I wish Intel ran all their memory at the same speed like everybody else does across their product lines, but it is what it is. So the Intel results are limited by memory bandwidth. I had about 230 gigabits a second. It makes sense because the memory is running at a slower speed than on the AMD. And so the performance is basically, you would expect the performance to be the same as AMD, flat mode, if you had the memory running at the same speed. So basically, I wanted to try Arbor TLS, but I'm running the BIOS from HAL, and which locks out everything. I got a fix on Friday to enable relaxed ordering to see if we could get some better performance from KTLS, and unfortunately, the fix didn't work, so I don't have any results that I'm comfortable with presenting. But here's a little summary of where we are with the alternate platforms, the software TLS bandwidth. So on AMD, we have about 280. On amp here, we have about 180, and on Intel, we have about 230. And for the max hardware TLS bandwidth, on AMD, we have about 380, and on amp here, we have about 320. And I was gonna end this with something really impressive because we built an 800 gig machine. Unfortunately, we had some storms in the USA, which delayed shipping, which meant that the machine didn't get to the data center in time to get racked. So maybe I'll be back next year to talk about 800 gigs or something along those lines. In the five minutes left, oh, first I wanna thank everybody who helped, especially the Netflix side, Warren Haroop. He's one of the people that works on new hardware at Netflix and is in charge of building all these prototypes. And so he's like my Santa Claus and brings me all my new toys. And I also wanna thank the FreeBSD developers who I've worked with over the years on these things, especially Mark Johnson for all of his work on both Numa and the VM system and KTLS. John Baldwin for doing the bulk of the hard work of upstreaming KTLS for us and working with different vendors to hammer out a hard work KTLS interface. And also to Jeff Robertson for all of his hard work on Numa. And I'm sure I'm forgetting people and I don't intend to slight anybody but I just love working in the community and everything that we've done is a collaboration. I have my slides are up after the talk. And so now I have some time for questions. I think I have like four minutes. So I just start reading off the panel here if I can get my mouse over there. If you switch to the shared notes tab at the top we've organized the questions for you. Oh, okay. I'll see how it's bubbling up. There we go. All right, is Nick KTLS available and current? Yes, it is, it's available and current. I actually think it's available on 13. Okay, but it only works with certain NICs. In FreeBSD, I believe the only NICs it'll work with is the Kinect X6 DX and the Chelsea OT6. How fast is NFS over TLS? I don't know, that's something we'd have to talk to Rick about. Are we using PLX chips to connect Gen three to Gen four? We were using Broadcom. I think Broadcom, did Broadcom buy PLX? I don't remember. We were having some interesting problems with Broadcom PCI switches where they would run at seemingly Gen three bandwidths. If you put like one or two, I think it was like a Gen four by eight to four Gen three by fours. And if you put two Gen three by fours active everything was good, but if you got over two then things were running at Gen three speeds. That's why we had all kinds of problems like building a machine with just 64 PCI lanes. Would faster DDR memory available to client SKUs help to bump fabric? I'm not sure what you mean clients. Oh, do you mean like gamer machines? Yeah, in terms of fabric bandwidth I think the Infinity fabric is limited but it would definitely help with memory bandwidth. I've often like, I used to overclock memory in servers a while ago. No, and no, we don't use total TCP offloading. So we have essentially a whole group at Netflix that's parallel to our group which focuses solely on protocol improvements. That's people like Randall Stewart and Lauren Stewart, no relation that do a lot of the TCP work. And one of the things that we depend on is to control every aspect of the TCP stack and that's what allows us to give such good QE to our customers. So how many flows can the NICS handle? What happens on overflow? I kind of talked about that a little bit. I believe the limit is, you know, I don't even know what the limit is. I thought it was 128,000 per port, but I'm not sure. And what happens on overflow is that the hardware, the hardware TLS session is just not set up and it falls back to software. Is the data encrypted in storage too? No, I mean that our assets are DRM encrypted which is between somewhere, some encoding in the cloud and your set top box or your phone or whatever, but there's no extra layer of encryption at rest. Do you perform traffic steering to the optimal NIC? No, we don't. We live with the decision that our link partner gave us. Now, somebody grabbed me after my last, a couple of years ago after my last talk and said there is a protocol to tell the LACP to redirect things, but we don't have that implemented. And so we just, whoops, all the questions just appeared. Oh, there we go. So we just use the, you know, whatever LACP gives us. How do we measure percent CPU? Basically, I just use the same numbers that as you would see with VM stat. In terms of front end stalls on, and actually I don't use VM stat. I have a tool that I wrote, it's in ports. It's called NSTAT. It shows you various things like memory bandwidth, sorry, it'll show you memory bandwidth if you hook it to Intel PCM, it shows you CPU use, it shows you number TCP connections, it shows you network bandwidth, it shows you context switches, interrupts all that kind of stuff, all in one line. I don't have any thoughts about the Xilinx smart NICs. I am not aware of them. Typically, we try to go for Xilinx, makes me think FPGA, and typically we try to go for A6 just for power reasons. We were engaging with a NIC vendor that was building a different kind of TLS offload NIC where they had something like 40 or 64 gigs of memory and an FPGA on the NIC, and that changed the TCP retransmit story because they just were able to cache everything in their NIC, and that was nice, except the NIC burned as much, I think the NIC burned as much power as the rest of the system, so it was just not a really good fit for us. We're always willing to try NICs. We've engaged with other vendors. I think that, I actually think the FreeBSD is ahead of the curve on NIC TLS. I know that Linux uses it, but I think we are a little bit ahead of the curve because one of the things that I meant to mention is that when I was first working on Kernel TLS, I had this idea of NIC TLS in my head, and I kind of kept that in mind when designing the way Mbuff's work with Kernel TLS. And the nice thing about FreeBSD, the way Mbuff's work is that a single Mbuff bestrives an entire TLS record. So it makes these retransmits really easy for the driver because the driver can just look around in the same Mbuff and it has all of what it needs to do a retransmit. Let's see if this question is being written. Oh, okay, that just answers being written. I guess that concludes my talk if there's no more questions. Thank you very much. All right, thank you.