 Good afternoon This is this is a talk about quick quick is a new next-generation internet transport protocol It's designed to replace the TCP TLS stack and sit where you would use HTT P Offers a number of technical improvements over TCP has always on an encryption or authentication congestion control And it allows the layering of new protocols and services quick has the ability to expand and grow The more importantly it has political will and enthusiastic backing by developers and quick has sucked The air out of the TCP room all the current work looking at internet transport protocols is now looking to quick first TCP is dead in some ways and I can't say if quick is going to take over the world But I think it would be unfair to ignore the place that is already established for itself on the internet and in the communities that continue to develop the internet 50 years into its lifetime Linux has been the platform of choice for quick development and it's been tuned and Improved to compete with TCP and TLS there, but all of this has happened really on Linux Implementation developers of of quicks do care about portability, but Linux developers don't care much about us Quick is painfully full of puns and jokes about speed and using the letters Q u i c And I guess I've contributed to this too with this talk making through BSD quick I'm I'm Tom Jones. I'm now calling myself a recovering internet engineer I love the film Brazil so for a while I was calling myself a militant internet engineer But people always thought I meant military and it wasn't the right vibe I worked for about eight years doing internet protocol design and development and research and intermultation in the ITF And I'm to blame for a couple of RFCs Super relevant here is the services offered by UDP and I wrote something called datagram Pactization layer path and to you discovery which I can say faster than no one else can Which hopefully grow into use on the internet I like to hack on the free BSD network stack and for a long time I want to make the internet better I still do I'm working the ITF enabled some of that because of the running code mantra Allot us to try stuff and free BSD really quickly and see if ideas were any good and a lot of time They weren't and that saved people a load of effort and so it was great fun I say I'm one eighth of the BSD now hosting team you can figure out the maths for that And the start of the year I left academia and entered the real world and now hack on free BSD with the team at Clara and it's great So my last few years in academia working on Making sure quick could work well in satellite networks. So my networks are special in that they Dissemble all the traffic and accelerate how it gets through and quick breaks all of that I spent a lot of time building testbeds and designing experiments to see how quick would perform And I ran a ton of iPer three tests to make sure my testbeds actually worked. And so this talk is a reasonably gentle introduction into Doing network performance engineering and evaluation so we can answer questions that we actually frequently want to answer and Sort of the form like is quick on Linux faster than on free BSD And I have another question here, which is a motivator for all of the methodologies I have which might not always come through What can we take from the the Linux development of quick and UDP and take back to free BSD? Because they must have done some stuff to make it go fast So let's go all the way back to the beginning What is a transport protocol transport protocols are how we move stuff around on the internet? They sit on top of network protocols And so network protocol is IP or IPv6 Stand of the list and then we have wider selection of transport protocols And the most common transport protocol on the internet and it really was the core of the internet until quite recently is TCP TCP offers a reliable in-order byte stream the bytes you put into the TCP socket come out of the other side and Unless something goes wrong But if you think they've gone missing you're probably wrong and it provides congestion control to fairly transfer data Which has been the area of research really in TCP for the last few years. It's a single stream And it's very simple and so TCP is a building block for the internet We have if TCP had been much more complicated We might not have gotten the internet but because you can write client servers if you're writing reading and writing from files with some complexities We're able to build Prototypes are new stuff really quickly and we've got an explosion of activity and simplicity enabled this Encryption authentication weren't really possible when TCP was developed and so they're layered in and this is why it's a secure socket layer Or it's now TLS because I had to sit on top and it sat in the middle Reliable and in order mean that the data gets there, but in order means that the Data gets there and anything gets gets lost gets stalled And so this has been some of the one of the drawbacks in TCP is that when you have packet loss You lose the ability to be real-time because you have to resend this data and recover This is a bit TCP UDP is the other internet transport protocol You might know it from some famous films is DNS ARP DHCP RTP TFTP NTP Yeah, like simple stuff and UDP is used to substrate for a ton of other things. It just sits underneath all of stuff UDP offers a very minimal service the UDP RC is two pages long. I put it in a slide. I put it on slides once and people left It offers transmission and reception of packets and port number multiplexing and that's it that really is the meat of what's there the other 25 pages of RFC a 304 are the realities of the protocol, but it is very minimal Multiplexing is important because without port numbers. You'd only get one Service on your machine. You just get UDP packets and that would be it UDP makes it possible to have very simple protocols for very small devices And it gives us the ability to bootstrap stuff and that's why we see it for ARP and DHCP because it is stateless It's very straightforward to use it's very easy to implement You can put in a very small things and it's low latency and low latency is a hilarious claim really UDP is Unreliable it doesn't offer any service for the retransmission or reliable delivery of data And this is really handy when you don't care of your data gets there You might wonder why you wouldn't want data to get there but frequently retransmissions cause delays and so if you're sending audio or video and it's live Your codec can recover a lot of loss And so you want the data to go away And there's even an extension to UDP called UDP light which allows you to have broken frames come through Which is really cool if your stuff can handle it, but not used very much. It's used in telcos There are other transport protocols and a big one in the big one in the 2000s was sctp the stream control transmission protocol It offers reliable in-order byte streams. It has multiple streaming. It has multi-homing It has multi-path and it's got really heavy use in telephony networks But you probably haven't had much interaction with it in real life. It is the core of the Original data channel model for webRTC and so enabled some of the cool services you can do But it's had little actual direct use on the internet This is because it's been considered hard to deploy when it came around in the 2000s the network was very fragile and there were lots of strange boxes And the ITF sort of bungled the deployment of sctp because for a lot of this history It was going to be tunneled over UDP which it can be and then it became its own Protocol number and so that made it harder to deploy and because it's configured considered to be difficult to deploy It's just not really used There's no NAT standard for sctp There is a draft of it and there is one paper using a free BSD, but there's nothing else And so that's the sort of barriers you're up against Okay quick quick is a reliable in-order stream protocol. It offers multi streams multi homing very similar to sctp It has always on authentication and encryption. So the authentic and creation and encryption are tightly Integrated into the transport protocol. So this is something for the world. We live in now It added native features that help assist with congestion control, which we're missing from TCP But it also took eight years to develop quick v1 and so these we added them into TCP and over the time But it's also extensible there's a big understanding that You can't get stuff into the Linux kernel TCP stack. It's just horrific and so people just thought it was impossible and so inside they just wrote this off and So quick runs in user space because it runs on UDP and people can iterate on it really quickly And that's why there are countless quick implementations and very few TCP implementations quick is a framed protocol So it sends packets the packets are encrypted and inside of them are individual frames It's very easy to add frame types And so through the evolution of quick getting to quick v1 and beyond we had new protocols added And so on top of the reliable in-order byte streaming quick. They've added unreliable datagrams So we have unreliable datagrams reliable unreliable But they added these extensions that enabled new protocols to pop up and this is really exciting because we can now get protocols We can deploy in the internet that will actually work This is a performance talk performance is the measure of a system to do useful work and when we talk about Network performance we're talking about the networks ability to do useful work if we do analysis the analysis of the performance of the system We're normally trying to saturate the most expensive component or something We can't improve and for reasons completely beyond me network bandwidth is limited. I don't know why we could have more And so if you're doing network tests, you want to be able to saturate network bandwidth because normally you can scale everything else out This is a bottleneck in the system and the bottleneck sets the the pure capacity of what we can do It gives us the most that is available all networks have bottlenecks and A real common bottleneck you have is at home where you'll have your ISP selling you a limited service or they're multiplexing you in So more people can use the line so you get more contention And so we have tons of algorithms that can do bottleneck detection and this is one of the things We actually want to try and miss we get bottlenecks from loads of different parts of the system And so we have bottlenecks through the network interface Which today is the thing we're trying to hit But we also need to be aware that there are bottlenecks further in the system And if we accidentally hit the wrong bottleneck all of our tests are wrong and we need to know what is going on and interrogate our Results so that we have done a performance measurement of the right thing and not a performance measurement of the wrong thing And so we see bottlenecks around the PCI bus and memory bandwidth The ability of the CPU to move data around we can become thermally throttled thermally thermally throttled which I have no idea how to evaluate on free BSD because it's not not helpful It's just some messages in the console. We can become limited by disks But thankfully it's not gonna happen today because I don't have any discs But one we need to be really aware of on looking at a transport protocol is the transport protocols mechanisms to preserve the network And so transport protocols have two different ways that they do Control over send rates and the traditional one is flow control So flow control is the receiver saying I'm able to take data this developed a TCP with the TCP receive window And a TCP receiver in the 80s might be able to handle Five packets like really small numbers of packets So we get packets and say you can leave me alone and the package would be very small And this is ability for the receiver to say back off in The 80s towards the end of the 80s We had two instances of it was called congestion collapse where there were senders in the network and they Overloaded the network and then debugging the network overloaded the network and then we overloaded the network and everything just fell over and This has become sort of a big boogeyman in the internet and people are very worried about it I'd love someone to try a global test to see if you can not the network over now. I'm not sure you can But this led to the development of congestion control And congestion control is a sender sideway of limiting the traffic you send and so it guesses what the capacity of the network is and Sends based on that capacity And so if we're doing transport protocol evaluation We want to make sure we're not becoming limited by either of these and you can easily become flow control limited TCP grows its receive window as Dynamically and it grows in RTT based steps and if you run this on a satellite network You get completely receive limited because you grow by 16k a time and you get interesting plots from this But you don't end up very happy and so we need to be aware that these this bottleneck can be there There are other barriers for performance that can happen through the OS But another area you need to be careful with for performance tests is you don't become application limited You don't want your test application to not be sending Application limited traffic is very common Streaming video Has a dash style workload where it tries to send a big chunk of data and then wait because it doesn't need to have a sustained rate And so it spends a lot of time application limited and so if you run a test like this and you're not saturating the network might be why We have loads of different units for measuring Network performance today. I'm going to look primarily at throughput tests a throughput test tells you the number of bits You can pass over a network in an amount of time But you also will see other values reported to you and they're derived different ways And so if you run if you use SCP to do a performance test is giving you the good of the connection This is the data that you can move over the transport protocol Without any of the headers and so you inherently get a lower rate You also get different units and so you might get really confused when they're a time smaller But they get mixed up and we also get packets per second measures And we normally talk about these are mega packets, maybe killer packets in some operating systems And then there are other measures which are relevant but not why I'm going to touch in today so there's concurrent connections and requests per second and These two could really important for scaling out of a transport protocol But as we get to later my test bed doesn't really scale out so we can't look at these We have great tooling for looking all of these. I always recommend people use ipr3 It's it's different But it allows you to flexibly choose between different protocols with a single server You just run the server and you can choose protocols But it'll also give you data in JSON, which makes it really easy to integrate into further processing pipelines Which other stuff doesn't do and for packets per second We have packet gen and then for concurrent connections for HTTP style workloads There's two like work if you pick a new tool you need to make sure it's good Okay, so when we want to do performance measurement We want to use a existing tools because we're trying to measure Something other than the tool itself if you invent your own tooling for this you start to get different results Because you need to validate that the tooling actually has a valid workload And so if somebody comes to you with a brand new tool and says I'm that the network performance is terrible You don't need to think about all of the other places. It could be terrible because it's not battle hardened Academia is great for this because he'd be like, I've never heard of this tool reject But so you need to be careful Network tooling helps us understand what the network can do and IPer 3 is great for this IPer 3 is I don't know how many times you'd say it's a great tool It's very unfriendly, but it's it's a great tool and you can get a lot from it IPer 3 does throughput tests and a throughput test Basically just trying to send as much data as you can And you can run these in different ways you can run to try and Scale out for a fixed amount of time, which is what I prefer three defaults to You can measure a fixed amount of data which can be more useful if you're getting a lot of variation And it sends up to the maximum you can do in a single core IPer 3 cannot scale out to more cores Which is a reason why people still use IPer 2 You can scale it to more connections so you can get interesting variation this way But you're always going to be running up to a hundred percent of a CPU core and that'll be the tip the top of what you can get IPer 3 runs as a client on a server Server listens for a control connection over the control connection. It negotiates transport protocol to use So we'll run TCP UDP or STP It will share test parameters, which I think I've duplicated on these wherever And it will share Configuration so IPer 3 can do different congestion control algorithms If you run on free BSD and Linux and you pick a congestion core algorithm that has a different string for its name It breaks which is fun someone should fix that And you can run like different empty use if you run UDP test with IPer 3 It paces the traffic on Linux, but not free BSD and you might see weird artifacts if you're not aware of this Depending on the rates you're on by default the client sends which confuses the hell out of everyone because they expect to measure The other way, but you can send back the other way and Then the connection will report once a second by default what it is is Experiencing and then at the end you'll get results from the client and server Okay, so quick has been designed to be used in all the places you would use TCP Originally it was just a replacement for HTTP to but it is scaled up It has this core design so there's a lot of similarities to TCP and a lot of the language and the standards called back to TCP But it's meant to fit through and quick really does care about throughput and requests per second and concurrent connections because the the people developing quick are almost all-way all skew towards CDN vendors and they want to be able to replace H2 and their load balancers with quick because they see they're gonna get a lot benefit for these sorts of workloads And so during the development of quick there was some Performance work done primarily driven by Microsoft that have a quick stack which is M buff compatible So you could run it inside the kernel and they wrote a quick performance draft which defined different Mechanisms for looking at the performance of quick I'm not gonna use any of that today There's also some work done by Fastly so fast they have a quick called quickly because you know Spelling is overrated and they wrote a blog post which is my preferred way of disseminating information Looking at the computational efficiency of TCP and quick I think this came out like the middle of 2019 Which is near the end of quick v1 But there are a lot of voices saying that quick is just not gonna scale and it's never gonna be able to match TCP's performance There's a lot of fear around quick from network operators And there was a lot of push because the thing you get with quick is you get encrypted traffic and network operators lose the ability to monitor traffic and That upsets them and so they wrote this article. They looked at the computational performance of TCP They ran very stuff to very similar tests to what I'm gonna do today But I was looking for something else and they ran a metric for a single core per CPU maximum throughput But they clocked their core down to 400 megahertz on Some architecture they had and this made it really difficult to take their work and then reproduce it because I I don't Want to clock my machine to down to 400 megahertz I've got a different micro architecture like everything is gonna fall out differently here And so that's sort of why this is diverged away from there Do you perform science or does it occur? I don't know naturally occurring science Okay, so we're gonna build a test bed and we're gonna try and answer some questions Today my question is is quick on Linux faster than on free BSD or really the question that motivated this is what can we take from the development of Quick on Linux and steal and have back for free BSD so can we look at the development history of a quick and Figure out how why it got faster and see if we can go on so how do we how do we do this? We're gonna measure the throughput of a quick connection and We're gonna do this while ensuring that quick is the bottleneck and we're gonna use CPU saturation for a proxy For a quick being a bottleneck so if we can completely saturate a core Then we should be busy and we know that there's there's nothing more there And then we can start looking at it And if we're not completely busy then we need to evaluate the test bed and keep going And so to do this we need a representative test bed which is reasonably reproducible and that's what the picture on the first slide Was was half my test bed We need a way to measure system load during the tests because we're using CPU utilization as a proxy for the system we need baseline measurements so that we can check the results we have against reality and We need a quick to measure and a methodology to follow so like hardly anything All right test bed ask you are as promised So logically this is the layout of my test bed. It's made of two machines left and right ones on the left ones on the right As you look at them not the other way around because they're behind me one machine is Joe booting off two different drives free BSD and Linux and the other machine is running Linux So that we're trying to reduce our variable space So the main thing we're changing is just Linux or free BSD on the on the machine. We're testing We have a control mode which runs free BSD and they're connected on a 10 gigabit switch And it does all the marshalling in control of the network And so this is a really simple small network and hopefully something you could take home and reproduce yourself The test bed machines are a matching pair of Desktop machines. This was a an idea. I had that maybe we could just super commodity gaming hardware to build test beds for a lot less money And so it's made of two AMD Ryzen 3,900 x Systems the 30 gigabits of RAM they have identical hard drives apart from one. I couldn't replace but there's no disk activity here So I'm happy to say that this is not a factor They have Intel 520 Joe port 10 gigabit nicks the only ones I could buy And because the shortage and on the PCI bus is also Some old graphics card, which was the cheapest graphics card I could buy which cost a lot of money for something that was 15 years old Just the way the world is the control node is I3 knock The free BSD systems are running free BC 13.1 release. I don't think I've patched anything the Linux sides are running Ubuntu 22.04 LTS. I tried not to update anything But I don't know like I ran some updates because I had to install software and it wouldn't let me so I can't really be super certain about Stability in the versions here, but we could go and dump this out and we'd have an idea how it looks today But then of the system updated would just be furious The test bed machines are small and few in number and their size does limit the sort of investigations We can do there's a great quote from an IRC channel. You can't reproduce a third of the internet traffic in a lab now I know drew can but I'm not drew But what we can do is we can look at the single thread performance of a quick and This is good because this is how the quick library was developed when they're working on quick They're just trying to get the protocol to go at all And and they weren't trying to scale out loads of different cores on a big machine They were running stuff on a laptop and so we can follow a similar Methodology to the developers test environment and it will help us interrogate what we're getting And this helps because we want to see how the system evolved. Okay, so Measuring load because it's the thing we have to do next. I don't know how you people measure system load I don't know how they do it The quickly article they just said a hundred percent CPU with no evidence I think the answer is eyeballing top like I looked at activity monitor. It said it's fine And that doesn't scale to automated measurements I must have run. I don't know 10,000 measurements it through the development of the tooling and Setting up the test bed and getting enough runs so there's I can fill her out statistically There's no variation in the times and I can't just sit staring at top all day. I Want to read Twitter? And so we need something else, but maybe we can look at what top does So top reads on free BSD it reads out to see CTL's it reads out current dot CP time and current dot CP times Current dot CP time is five values and it tells you for the system this thing It's the schedulers breakdown of user and I system interrupt an idle time and then Top basically just prints this out current on CP times gives you these five values for each CPU So you just read these out in a chain It gives you values from system boot And so just reading out these controls tells you what the system has done in its life so far and you just take multiple measurements to get any sort of performance variation and So it does something like this it runs to current dot system Runs assist CTL twice and it does a delt between them and then it adds them up And it figures out just the breakdown of percentages against the total so that's all that top is doing And so let's just take this methodology We don't want really intrusive information about what is going on the CPU because that's gonna be very difficult to compare between operating systems because we're gonna get a lot more noise and instead we can take a coarsely grain metric like this and Something similar to what people really do, which is just look at top Thankfully Linux does the same thing Linux reads Proc stat and then it gets a list of stuff out it gets 10 values There's only seven of them here because there's open WRT box doesn't have any guests I guess As it gives you more values, but it's the same theory like it gives you these things and it sets it up Now my next slide says that this hasn't loaded. Oh cool has it loaded So Cool, okay, so the demo will not work. Oh, I need to refresh this slide Um Oh Yeah, so I will show anybody that wants to see this later um So Okay, and so that was a plot for my tool I wrote to Plot out the current dot cp times Um, and it was a nice little graph And so I wrote a shell script that I run via inet d and it delivers a web page And then the web page requests a data url and the data url is the same shell script and it will give you current dot cp times Um And a list of the processes and so you get everything we get in top and plots of cpu activity And I have like a half for an extension for network activity, but the scaling on the graphs wasn't quite right because of the order of magnitudes What I found from running out these plots is that on free bsd This scheduler will just bounce stuff around. So if you run open ssl speed, which is a really simple Test on the four core machine in the controller on the test bed. You see it just Hectically going everywhere let me get even balance across all of the cores But this is just going to make it very difficult to do measurements for and if we run these tests on linux We don't get this we get quite tight binding and so from that I realized that we need to CPU set so we can actually record a measurement because we're trying to measure what's happening on one core Okay, so the next thing we need for our test bed is some baselines So we can understand what it can do and this gives us a reason a reasonable level of scaling for what we can actually expect because we're not going to go at 400 gigabit and our 10 gigabit test bed But we do need to know if we can saturate the link or if we have to be more creative It's actually the third version of my test bed and the last two versions couldn't saturate the link With udp and this one can't be udp and so Yeah, so if we use ipr3 And cpu said it so we're getting something closer to what we're running. We see that with tcp We can hit 9.5 gigabits per second, which I think is about the theoretical maximum for what we can do in a 10 gigabit link And that's good And and then I eyeballed top because being lazy because I've not integrated ipr into my measurements stuff properly And so that's for about 22 of a core And so if we were to able to scale that up then for 100 of one core we might be able to do 50 gigabit, which would be pretty cool udp is a bit sadder For udp 100 of a core we can do 6.3 gigabit per second Which is quite nice. It's quite a lot of traffic It's a really nice number if you have a AT core machine It's 400 gigabit which was drew's last talk So we'd be there in terms of transport protocol But not there in terms of actual useful throughput, but I'm not here for useful stuff. I'm just here for transport protocol Um But we do need to wonder why tcp is so much faster than udp and a big part of this comes from tcp is Has been the focus of the internet for such a long time and so One of the main features here which enables a lot of the throughput for tcp is called tcp segment offload Um, it allows us to reduce the sending rates through the user space to kernel api It allows us to reduce the traffic time to the card by sending big chunks of data And a way to determine where headers should go and then lets the card do it And the hardware acceleration really helps us here and if we turn off tso We see a drop of about a gigabit and a half a second of traffic But we see a shoot up to about 97 of a core and that's actually quite good to see It's quite reassuring that we can turn off some of these features There's obviously more in tcp helping it go fast And it would be interesting to see what we can get at Gleb shmirnov suggested that we go to uio ring for the socket buffers, which is some work that's being enabled and so we'll see what happens Okay quick there are many quick implementations. There are There have been a lot that are now dead All of the results here come from fastly's quick quickly. There's 50 50 page isa report you can read if you really want to know why we picked quickly I picked it for this because of familiarity because I wrote a 50 page report about it But in the end it comes down to its written in c which gives us more Predictable ideas about what it's doing convert compared to rust or go or python The developers were very active in the quick working group Jana Ingmar is was the chair of the working group one of the original people in quick and it works fastly on this And because kazuho the main developers always around it was following the drafts of quick very well And so in the satellite project it was very helpful to be able to track draft releases and see how our changes were Making quick better and it has its own test tooling which is good because we want to run its test tooling And as a bonus they've done some performance work that we can copy a bit Um, so that's one part So now we have a quick now we need to figure out how to run quick a lot And so I wrote a tool called commit rate because we're going to iterate through commits and I think i'm funny Um, it's a distributed tool for comparing software over its development time It configures and coordinates the start of an experimental run. It records runtime tests collects data from a host It's written in python using fabric because I couldn't figure out how to do this timely enough with any of the other stuff Fabrics awful. Don't use it There's there doesn't seem to be anything else George neville neos suggested something else that he had written but it seemed much worse It's very old. Um, I don't really have a recommendation for how else you would do this. Maybe don't And fabric basically just establishes a ssh connection and then runs commands over it So this is high speed automation of shell scripts Canary it works through a list of commits and it connects to the client and server machine And then it does the work of building and running the tests and it gives us a total time and from the total time We can determine throughput we We can build and run quickly Quite simply it's some small instructions and it's in its make file And so from august 2017 it was cbc make based which is going to help when we're trying to go through its history Um, and we need to pull down a repo and make a work tree I found work trees really helpful for having artifacts for debugging the test infrastructure It might not be a necessary step, but it's good to have Um quickly depends on a couple of git sub sub modules So we need to pull those in and then we do a cmake make Generate keys run tests running tests is quite straightforward. Here's the python That I wrote for fabric for running the tests The server side has been static pretty much all the way through the client side. They Harmonized some test URLs. So before a certain commit we need to have a dot txt there But it's it's quite straightforward. So we can run one tool. It's called cli. It runs as a client and a server It was the primary test tool for the development of quickly Which helps because that's what we want to look at because we're trying to see what the developer saw To do thousands quickly We need to try and get this to run quickly And this was very hard to do Builds run really well in parallel. I could do a thousand builds of quickly Half an hour on my test bed machine. So this was fine. What I couldn't do is make the work trees very efficiently A git work tree seems to take a lock For some reason and then that would just block out if you tried to do a hundred of these at once Um, so if you want to build thousands of revisions of a of a git tree You might want to figure out if there's some way to just avoid this workload So how do we figure out what commits we should test because we want to look at the development and the performance improvements of quickly over time So quickly started development in april 2017 and there's been about 2100 commits since then which is a ton of activity for one person Um, they have happened in bursts around itf meetings So it's really not fair to say like how is this in January, february, march because there might not be anything happened in that window and you just get the same result So instead I looked at steps through the commit tree It has followed basically the entire evolution of quick. Um Yeah, maybe the entire evolution of quick, but before a certain point it wasn't usable It would fall over And it became clear that I needed a couple of things to be able to run tests So I needed the cmake to be able to do the build so that took us to august 2017 And I needed the cli tool which was there from the very beginning But the cli tool needed to be stable And I had some trouble figuring this out because the errors and the Failure states of the tool changed over time as it was developed and got more developer friendly But thankfully somebody wrote a tool for iterating on the builds of a program to see how it was performing So I used that to figure out what tests to run was me. I did this So the first stable commit that I I found through mostly guesswork and pointing at different parts is is here the excellent commit message of um typecast null is uh undefined behavior on a stack overflow url This this came through in uh can read it at september 2019 which is actually reasonably recent It's quite far into the development of quick um And quick v1 was released in 2021, but it still allowed us to look at 1100 commits And the last two years of development and I still thought this was a good place to look at And so I wrote another tool which is just sort of a fancy git rev parse that allowed me to pull steps out And it gave me these uh 30 commits because I asked for 30 commits I don't really expect you to do anything with 30 commit hashes um And and then I ran the tool and then I iterated on the tool for a while and I pulled out some data And then here's here's here's the result. So I was Yeah, let's see So if we look at how quickly his performance changed on ubuntu 22.04 and uh free bsd 13 over time We can see that other than a small Aberration the start of 2020 Free bsd is doing better than linux. We're getting like edging out very slightly better throughput And at some points we're getting a lot better throughput But this is not really matching the thesis of interrogating a tool that we want to do And so I I I actually find this quite difficult to look at because the lines are quite close together So I I'm not going to say I invented this metric because I think it's in the jane book But I came up with this metric which is um megabits per second per percentage of cpu So if you're at 100 cpu you divide your measurement and you get one And if you're not at 100 cpu you you look at your measurement you get a different number And we're also not really seeing a difference in cpu utilization here either So we're basically pegged for all of these results at 100 and so We're a bit stuck What should we do? Okay, so I'm lying. I found this earlier. I found this on my previous test bed So the results are a bit slower And there were other commits that were broken, but I saw that previously was faster all the way through Um, I'm really happy about this. I said in the podcast I've got surprise and then I ran on my new test bed and it wasn't as big a surprise 3bc was doing a lot better on the older op-drawn hardware and the newer horizon hardware I think a newer ubuntu it was sort of matched up so you could see the progress in the Linux kernel I didn't interrogate this but somebody could And so I asked as you hold the developer about it now slack 8 the conversation So you're just going to have to believe me that I had it Um, but he said that I should look at gso Gso is where all the work is But don't use the other stuff. It's just for his laptop because he develops a macOS. So it should be fine So what's gso? So gso is a generalized version of tcp segment offload It's one of those things that allows tcp to be very fast It was introduced into the Linux kernel about 2006. I think this is when the l w n Article I found was so it might have been a bit later. So it's been around for a long time At socket data is passed to the kernel in a batch I think it is passed through as ancillary data for where the headers should go But it was really quite straightforward. I think you need fixed size packets. You say here's a chunk of data 64 ish k is about 44 packets at 14 60 bytes And then it sends this for you and allows you to drop loads of time that you spend sending And nix can support this. I'm not quite sure. John didn't seem super Super eager when I said it so the comment and the The comment on the quickly commit it says depending on nick availability And so it might be that fastly have something that we don't I've not looked it might be and so if we run with gso we get to add another line to our plot So now we have a dark purple line to go with the lighter purple line And gso gives us a lot So the older results there where free bsc and linux are sort of matched We're almost peaking into five gigabits a second Which is quite healthy considering we're doing tls on top of udp And the machines could do 6.3 gigabits per second. So this is really good with gso We're we're almost running into six gigabits a second. We're losing, you know, 400 mega mega bit of capacity here from Um from the tls, but we're doing really well Uh, and if we look at the cpu plot, we're we're doing great with this metric. I invented um A hundred megabit per second per percent of cpu Uh is 10 gigabit and so then for one cpu we're saturating 10 gigabit and so we're basically there We're sort of hitting the limits of what the card will give us And the cpu and so we're we're doing really well. This could scale So I think g uh g or gso is great and something we should do no one is working on this So this is this has helped give us a shopping list. It's one item. So it's a quick trip And if somebody wants to go do this this would be really cool I definitely think it's the area of interest right if anybody wants to run quick on free bsd There's some other stuff coming. So Linux has generic receive offload Which is this in the opposite way basically where you get big chunks of data. It's a little bit different But you get big chunks of data Uh john baldwin has a work in progress for something called packet batching Which is an optimization in this similar vein. It's different But it's an optimization for udp traffic and it will extend to other protocols and allows the receiver to go faster The tcp tls can be offloaded to the kernel and this is where netflix are getting tons of their benefit Is offloading this into hardware and cutting RAM trips into the nick And it's no one seems to no one seems to have a product to do this for you for quick and udp yet So we're maybe a bit far away from here, but it'd be interesting to see if it ever does And the reason offload is hard is because authentication is tied really deeply into quick Or at least that's what I learned from intel complaining at the itf Um, so this is the end of my talk. Please go. No, okay I'd love to take some questions That's better Um, so I did a so the question is did you compare tcp and lro against quick So I did a baseline for the testbed with tso disabled Um, but I didn't so and I only did this from freebsd because I wanted to be able to answer that question But I didn't want to spend tons of time interrogating it because it was one of the things I started doing when I got to Vienna last week And the testbeds a bit flaky And so you lose the cpu benefits when you turn off tso, but you still get some of the throughput So you see a reduction. I'm not going to try and do that Like 20 reduction throughput But for a top like five times the cpu And so it is really is really helping for tcp How did you account for cpu boosting and clocking behavior between linux and freebsd? I didn't didn't look at it It is definitely a factor you would want to control for but You can get lost in controlling for factors. I instead a better thing would be a better approach is to have testbed that is documented well enough that it is reproducible Because then other people can come and look at this and build on your work And so it might be found that clocking was a big impact Great, but it's not one of the things I wanted to do here because you have to finish at some point Okay, so the question is with quick enabled you think there's still be a need for tcp tcp is not going anywhere It's I mean if you saw drew's talk you can see we're It's you know the bottom of the ocean right now like it's it needs to come up It's a long way off and then the second part was do you still see a need for sc tp? I were done a european project Called needs for two and a half years where we looked at deploying Making it easier to deploy transfer protocols on the internet It had one major result which was apple's network dot framework and it happened we coincided with their development timeline But it It's really hard to deploy protocols on the internet sc tp won't go away because I mean this is like saying will people stop running 386s in factories Yeah, but who knows when And so it will just be there because it's very core to the internet we use your phone is using it today You just don't know and so it's harder to be aware of what's there Yes Are you connecting with testing with connected sockets or not connected sockets? So quickly's architecture is to use single socket on the client and server to do a single socket on the Server and then it multiplexes out to the handlers for the connection and so for that it cannot be connected And so there's there's room there by doing a connected socket because you get to skip the The connection look up, but they're not doing it right now and that would be quite a disruptive change to make in the code for running these tests Yes Do you think free BST needs a send and receive multiple system call for further improvements Uh, so right now we have it is in libc This would be part of a gso work and gru work like this is probably the final part because we have the lib We have the API support if not If not actual All the way integrated and I think it's something someone would have to pick up. So yeah, it'd be good to have I'm not saying no we can do all this. Someone just needs to pay for it Nobody wants more. Yes, Dave Um, how would you see quick being deployed inside free BST if it became part of the co s core os in future? Uh, bapt and I spoke about pulling quickly into lib fetch, but I think we're just going to end of life lib fetch um It might come through it. It's it's a complicated protocol The the phrase is basically it's no more complicated than it needed to be but it's really complicated So it's quite a lot of code to bring in Um And right now it would be a dubious benefit It'd be really cool to have a kernel implementation and especially as there was one written to be important to free BST But no one seems to want this We might be just be behind on this, but No one's running a kernel implementation in Linux that I'm aware of outside of a lab I made my flame graphs and my slides clickable just for this I spent so much time on it And then I didn't show a single flame graph Yeah, these are from Linux. Um This should be gso Yeah, anyway, I think that's all my time