 Hey everyone, thanks for joining this session and not Marcus is one on HTTP 3 We are going to be attacking the problem layer down here Talking more specifically about quick and how we implemented a quick stack in VPP the vector packet processor So I'm a Lewis. I'm here with Nathan. We are both from Cisco and Nathan will now start introducing quick right Okay So first what is quick? I'm not sure everybody's familiar with it. So it's We've been dealing with UDP and TCP for a while and the quick is supposed to be the The successor of TCP some might call it the TCP 2 but not everyone does agree But more seriously how does it play with the with the stack that we have for now So We've been having HTTP living on top of TLS when it's a GPS TCP and the AP layer or UDP and the the idea of a new protocol is to replace the TCP plus TLS layer by a new layer called quick which is going to be hopefully coded in the user space and And and have some nice feature the purpose of it is also to switch from HTTP one two to HTTP Free to introduce new fish feature to address a few of the concern that we had with HTTP So some nice properties it should it has is that it provides encryption by default retaking the TLS 1.3 and shake Is designed so that it prevents specification in the network So most of the packets pretty much every every bit of it is encrypted and encoded so that the mailboxes cannot Take decisions on the packet path and do things with it There is built-in multiplexing Because it's a very common application requirement and that's some issue that that we had with HTTP It provides independent streams in X connection to to address this multiplexing Addresses head of line blocking that was an issue with HTTP one that was supposed to be addressed by HTTP 2 And support some kind of stream prioritization So basically it adds a bunch of new feature to the to the older transport layer Also a very nice thing that it supports mobility. So when you establish a connection between Five two peers and a five topple You get a connection ID that you that you are then able to move across Across IPs and ports The end that should be fairly seamless for the application So just to recap how things works when a server and clients Want to talk they can open connections and in every connection they can open different streams and each stream should be fairly independent from each other To recap a bit pros and cons so The cool thing about it is that it runs on UDP So it can be implemented out of the kernel and can evolve quite Quite fast, which is a nice thing because it's not yet an ITF standards, which gets into cons Again, it addresses head of line blocking provides mobility and provides encryption by default Which is which is nice when we try to move through each piece But on the consigns it has some complexity that we've tried to address by implementing it And for now we don't have a very standardized northbound API Something that we want to address in this talk so So let's now take a closer look at the code The building blocks that we we had for this project. We basically need Something that's a client application can consume provides an API that's Hopefully familiar and easy to use We need something that can send and receive packets UDP packets preferably and we need a quick implementation We are not re-implementing quick in this project. We will be using a library that does that So what we chose so for the quick implementation we used quickly which is developed by Fastly Which makes very few assumptions about how the memory is managed Where the packets come from and is very modular. So this was very pleasant to use For the packet processing we use FDIO VPP Which is the project we are working on at Cisco Which comes with fast layer 234 networking and the pluggable session system and also a client libraries that expose the session layer for applications So Not all of you may know what VPP is so very quickly. It's an open source Fast user space networking data plane. It's very much focused on performance It uses vector instructions We are very careful about cache efficiency in VPP It is extensible you can relatively simply write plugins for VPP It comes with all you would expect from a software data plane layer 2 layer 3 tunneling protocols, etc And more importantly for today what we call the host stack, which is a layer 4 protocols implementation So this is going to be our platform for for the quick stack So More precisely the VPP host stack is a generic session layer that exposes layer 4 protocols the API is Soket like meaning that it tries to reproduce it to reproduce the Bethlehem sockets API just to make it easier to consume for external applications Even though the internal API is more efficient in particular. It doesn't involve required copies to pass data around Instead we are using 5.0 Where applications can write data and the protocols can consume them consume it and invertly So it does have an internal API that you can consume in plugins The external API is exposed through a message queue But it's pretty much the same API and it's designed for high performance So we can almost saturate a 40 gig link with one TCP flow almost because congestion control is hard and we can fully saturate a 40 gig link with one UDP flow and It's built to scale linearly with number of threads. So Different sessions are always assigned to a thread And they are completely independent from each other So for a more visual overview of the host stack We see the session layer in the middle here Which exposes a standardized API over layer 4 protocols the specific protocol implementation below it And then a layer 2 and 3 networking which we won't focus on too much today And this session layer is consumable either internally with plugins that live inside the VPP process or externally with control events going through the message queue control events which include connection events like New connection connection closed data available on one of the 5.0s etc etc So know for What are the requirements for an application that uses quick? As Nathan said quick is a bit more complicated than your regular layer 4 protocol because it includes multiplexing So that means that we have a new type of object that we don't really have in traditional In a traditional socket API, which is a connections So a quick app needs to manage listeners Connections which are basically just shells for streams and which take care of encryption and Streams themselves where we can send and receive data so One of the first challenges we had is that we needed to integrate into a socket like API and The didn't really fully map to quick not entirely so what we did was First for the listener nothing changes you just listen the UDP port and That's really what you always do No things get trickier when you connect when the client connects it will connect to an endpoint IP port And it will receive a new connection a connection object conceptually The server will accept that connection and also receive a connection object And with that connection objects Clients can know well both the client and the server can know open streams So for the in order to open a stream we modified a bit a connect call to be able to pass a reference to a connection and Both the client and the server also need to accept streams on the client So like you would accept on a listener you can accept on a connection This is a this is a bit weird but we think this allows to To closely map the quick concepts to what you would expect from from a socket API So the connection sockets are only used to manage streams accept and connect them They cannot send and receive data for that you need actual streams streams can also be unidirectional or bidirectional on quick meaning that You won't be able to send data on certain streams you accept because you need directional streams always opened by the peer who sends data on them But this doesn't change as much so once you have a stream you can just call send and receive on this on the stream No back to Nathan for how we actually build that in VPP Yeah, so know that we have the the structure and the and the sockets that we want to expose The idea is to leverage VPP because that's the project we work in To to add this quick stack so coming back to this session layer that lives on top of UDP that we're gonna use The first thing that we that we did is actually build a second layer so replicate that session layer and add a quick protocol on top of The UDP and session layer. So basically we build a plug-in inside of VPP that acts as as an internal application and that consumes The session layer as would a normal UDP listener do and then those Sessions can be exposed to an external application through the message Q and control events So we replicate the five foes We replicate the data structure, but it's for it's at first for the sake of of easiness and pluggability inside the inside VPP Zooming a bit in inside this quick brick What we are gonna play with is a northbound interface that are gonna need to take from the buffers the five foes that are exposed to the to the client application Pass this data this stream data to to the quickly library from firstly that we are using internally the Pico CLS break inside of it is the Crypto backend so as as we mentioned earlier quickly, it's very nice because it exposed a lot of callbacks and Plugable interfaces that allows to modify and change things So that's the crypto API we're gonna talk about later and then it provides also callbacks softbound to copies the encrypted buffer to the UDP sessions UDP buffers in VPP that are then forwarded to to the wire This model is quite interesting because it provides three different Consumption models from from a client application You either can write an external app that's independent using the socket API we described You can also write an internal app Because VPP provides a large range of plugins that integrates directly with the with the internal API Provided as the standard C functions inside of the code base But you can also integrate directly with the interface provided by quickly So that allows us to be very modular and to to adapt to different use cases for example, if you want to build let's say Forwarder or VPN VPN termination point you can you're gonna maybe write an internal application That's just gonna do the forwarding and the mapping between Let's say TCP and quick but if you want to Adapt high perf to use a fake type of quick you're gonna maybe stick With the external with an external expectation external application win with regular sockets So like those sockets we defined we can also adapt if you have an application using the quickly library Using directly those callbacks Now if we follow a packet coming in What's gonna happen to to dive a bit more inside the the architecture what happens is that? the packet gets copied from the way I get copied inside the The UDP buffer RX 5.0 it triggers a session even that's get called by by or by quick Then the packets is initially decoded because of the because because because of the of the application everything is encrypted Then we match it against the connection matching I get a stream We decrypt the packets we do the crypto Crypto processing and then delivers it deliver it either to directly to quickly or to the quick stream buffer that's gonna act as the application buffer and then this This decrypted payload can be consumed either by an internal client or an external client via the message queue So this brings a couple of issues What happens if as those fifos are fixed size? What happens if we exceed the amount of data we are able to to store so that doesn't matter that much on the UDP UDP session Because that buffer is very merely temporarily. We don't really store data there It's just to pass data from the wire to the the packet decoding node, but that's that's stream Stream 5.0 can make thing break That's due to the nature of the quick protocol the the issue is that Before the packet is encrypted we have no way to know which streams is gonna contain data for In quick data can be coalesced between packets and now can also control control frames So and everything is a complete to prevent of specifications and once the packet is encrypted Quickly the library does not allow us to drop the packet because you would have to drop To revert all the control frame that you processed earlier so if you drop if drop the data the the data will never be retransmitted, which is quite sad for congestion for That that type of protocol that fortunately we have a connection parameter called max trim data that limits That communicates between the peers the maximum amount of data that you can Send an act on the on the stream that allows us to to control that Intersection setting it to the maximums 5.0 size And so that that proves a bit the maturity of quick Because it has several other connection level settings in that sense to control the maximum number of streams the maximum number of Unidirectional streams and total amount of an act data for the connection That allows us to handle those kinds of tricky cases. That's the implementation Sets on our way So after or X then we want after having received the packets we know when the TX one And so you probably noticed the TX pass is really similar to the RX pass is just the same graph But in the opposite direction basically so the data can come either from external apps or internal apps In which case it will be pushed into the quick stream session TX 5.0 Which will trigger an event which will called quickly or it can come from quickly app using directly quickly and Well telling it to generate new packets and send the data So just one thing to note here is that The six the 5.0s are by stream, but quickly generates packets for the entire connection So you may have data pushed from different streams You just need to call quickly wants to generate packets for the entire connection Because quickly will just well it will call its internal scheduler to select data from the different streams according to our priority Etc. Etc. And generate packets from them and creep them Put them into the UDP session TX 5.0 where they will be sent on their way by VPP There are also a couple issues on the TX pass that we need to take care of In particular the back pressure to how we make sure we don't Encrypt packets that we actually don't have space for in the UDP TX 5.0 to send them Which would be a bit sad a big waste of cycles Or how the apps know that they need to wait before sending more data so fortunately on the on the lower side it's pretty simple for UDP because when we tell quickly to send packets we can actually tell it to send a certain amount of packets and Based on the available space in the UDP 5.0 and the MTU We can just limit the amount of packets generated to make sure they always fit And then from the application side the back pressure just goes comes from the 5.0s Where quickly will stop taking data from the stream 5.0s and the applications need to check the space available in the 5.0s and They will see that they don't have space left to send data and they will naturally stop sending data Another question is also when the applications can know that Quickly has sent more packets and they are no able to send more data There is a host tag feature to actually trigger notifications to an application When space becomes available in one of its 5.0s So that works pretty well Another important thing that we have only touched a bit for now is the threading model So first generally VPP runs either with one thread or one main thread plus several worker threads And as we said earlier and sessions are always pinned to one thread one specific thread The thing is we don't get to choose which thread the session gets pinned on that depends on RSS Meaning that Nick will receive a packet It will assign it to a queue depending on its 5.0 hash and VPP will get it on the specific thread which manages that queue That's a bit of a problem for UDP because when we open a connection we will first send a packet in one thread And the reply may come to another thread We can't know in advance So in order to handle that the host tag has a session migration concept Where the application managing UDP connections, so in that case the quick protocol implementation Will get notification when the migration happens and can handle things correctly For now the quick sessions are opened only when the quick handshake completes which means that packets have been exchanged in both directions and So the UDP session is already pinned on the right thread and we just open the quick session on the same thread And that works fine as long as there are no mobility events because mobility events change the 5.0 So they can result in The connection conceptually migrating to a new thread. This will result Well, this will basically be a new UDP session for VPP which we need to match to an existing quick connection That's not yet supported that will be soon though And then just a note on quick streams the stream sessions will always be placed on the thread in which their connection leaves But no the important questions. Yes, how quick is it? So we did some benchmark obviously to see how well we were performing And the issue is that for now we there is no canonical quicker performance assessment tool. There are a bunch of of tools being developed, but there is no No kind of call. So what we did is develop a custom IP of like a client and server tool that That consumes the VPP message queue through shan memory. So attached to the application and The basic setup is the client opens and connections on each of those connections It opens and streams and then sends certain amount of bytes on each and every one of those streams And once the data has been received a closed notification is sent back on the streams And when everything is closed in a connection a closed connection even descent and everything should Should turn down nicely So that allows us to test correction, but also speed The setup we've been playing with consists of two different VMs two different machines connected back-to-back with a 40 gig link Excel 710 In each one there is a VPP running And the the test application attached to it So we do pin the calls run VPP and the test app on the same numer and use a 15 15 k bytes MTU those are 3.2 gigs CPUs and the first results we're getting is that for one worker So that was the first benchmark we did for 10 connection and test remodeling So 100 streams we're getting Approximately 3.5 gigabit per seconds for one worker We're seeing that this scales Quite linearly with the number of workers in hues. So we're getting up to 14 gigabits per seconds on four workers We didn't know much more exploration there But we're seeing some type of scaling We did some testing with growing up to 100k streams per core via doing variations in the terms of connection streams streams per connection and the end check rate is about 1500 connection per second limited by the the message you in the connector The connect interface mostly But that wasn't very satisfying at first we say we said to ourselves. Oh, we maybe we could improve that And and the and that because they're they're aware a bunch of Issues and things that we we could we could improve in our implementation At first quickly uses in generally a leap called picot elastic Which is also developed by the the same team at fastly to do the TSN shake and the packet encryption and encryption and As the API for this is pluggable and as VPP has no support support Now a native crypto API we We tried to use the the VPP crypto API inside of quickly in order to yield better results And we also try to to do some batching of the crypto operations In order to improve speed that's that proved the proved to be working a bit so basically what we do is Instead of receiving one packet decrypting it and forwarding it We do stack and packets Decrypt them at the same time and then pass them to the protocol processing Linearally so basically the the same idea as VPP is doing the VPP stands for vector vector packet processing and we we basically build vectors inside of this of this session layer And we apply the same ID to the to the TX path And that proved also to to improve performance And the last thing that we've been working on is the congestion control the default one is Reno, which isn't really Playing well with high throughput. We tweaked it a bit by changing the beta factor Because we didn't have time to implement new ones, but fortunately, it's also a pluggable so we We can we'll be able to to extend that in the future and the results we're getting with this is an improvement of 30% Approximately going up to 4.5 gigabit per second on one worker with the batching and the native crypto And we're seeing that most of the cycles are spent in the in the CPU doing in the in the crypto path and We hope that we'll accelerate with with the newer instruction in Intel Intel is like CPUs So to the to the future. Yeah, so what's next? Of course, as you've seen this is still a work in progress as I still some work to do on performance optimization one thing we want to work on is hardware offloads because the crypto batching that Nathan just mentioned Actually uses the quickly offload API meaning that We process it we can't well currently we process it with VPPs crypto API in batches, but it will be also possible to send the packets to Crypto card for further processing and to free CPU resources We won't support mobility. We'll have continuous performance benchmarking available soon publicly and the the sysit platform Which is the generic VPP continuous performance benchmarking And of course all of this is open source. So if you want to try it out or get involved feel free to to check out the code The use cases for this quick stack now because we don't do this just for fun. Although it is fun Our of course HTTP 3 servers probably the most common use case for quick But also gRPC over quick gRPC can absolutely use quick transport and it's really well suited because of the built-in multiplexing And there are also more network oriented use cases like quick VPNs Which is similar to SSL VPN, but better because it supports mobility and using one stream per flow allows to get rid of the head of line blocking in case of packet loss In an SSL VPN and also it's not harder to deploy because you just need a Certificate for your server and an authentication mechanism for your clients A funny trick we were experimenting with for the quick VPNs was to transparently terminate the TCP connections at the VPN gateway Meaning that the VPN gateway will actually hijack the connection and only send the TCP payloads the TCP data stream Into the quick stream instead of encapsulating the entire packets This has a nice property of getting rid of the nested congestion control issues But we are getting also into dangerous middle box territory here No further the main takeaways of this project. So first we really want to thank the quickly guys because we had a great experience with the library It proved to be easy to use pluggable very very flexible. So I highly recommend it if you want to play with quick We also know have an easy to use socket like API for quick in VPP and The VPP host tag proved to be extensible enough for new protocols, which don't have really the same concepts as the existing ones which was nice and Using the the VPP framework also gave allowed us to quite easily get a good performance boost for quick So we got a 30% performance improvement just by using the native crypto and the vector processing concept to our quick implementation But of course, there's still a bit more work required there to reach max level of performance We are a bit slower than TLS can be yet, but we hope we can reach similar levels of performance in the future Thank you very much for listening. You have any questions. So the question was whether there are other libraries similar to quickly Yes, there are quite a few quick implementations open source on github There isn't in particular NG TCP to which is the NG next implementation. There is a chromium implementation, which is implementing Google quick, which is a slightly different protocol from the ATF quick Which is currently being standardized There are quite a few others But we found out that most of them either were not implemented in just see which was a problem for an integration with VPP Or made some assumptions about what the network looked like of the whole day it gathered packets and everything So that's why we chose quickly So the question was how quick VPN compares with wire guard. I am not familiar enough with wire guard to answer that question I'm afraid Do you want to so do you know? Well, do you want to specify what wire guard is? Okay Right, so the question was that given that most of the time is spent in the crypto code What is the the performance benefit gain from using VPP? So it's it's quite simple actually VPP has a really really efficient crypto API Which compared to the default Pico TLS open SSL layers Gave us a bit performance boost by minimizing copies Increasing the batching abilities increasing cash efficiency and everything That's that's mostly Mostly yes also as well to the stack which is a bit faster and VPs and in Linux, but mostly Quick have any concept like Linux Because you've got a lot of data Build up Great point I Think it's up to us. Well, sorry so for the viewers on the on the camera. The question was whether Quick has a concept similar to the kernel small queues to limit end-to-end latency And I think it's would be up to us to implement that but that's a great Yeah, yeah, so in order to limit before but we need to limit to explicitly limit the FIFO sizes or the size used in the I'm aware of number of vendors which provides Yes, so the question was whether which card we use for the testing and whether we used cards that had physical well offload properties for quick We used Intel XL 7 10 cards for the test We didn't we know that these cards have some of the properties. We didn't use any we focused only on software processing Offloading is an interesting topic. There are some of loads that are possible inside the card or With dedicated devices further in the processing pipeline meaning once a packet has been received we can also flood the crypto for instance Both the in line of load where the card does the processing and then passes the pre-processed packet to the stack and the What we would call out of band of load Where we do some processing software then send the packets back to a card for crypto Which is at the main point. So both these models exist both are interesting We think the out of band is more flexible and easier to use but the inbound may provide better performance So you mentioned that you could only use the libraries Well, we say that it was it was not so the question was Did we choose? Quickly because it it was in C and that we needed that for compadre would be if I understood correctly and the answer is It was a part of the criteria It was not the the main one but easiness of compilation and interaction between VPP and quickly was was concerned because of of easiness of coding the It's not a hard requirement, but it makes lots of things simpler Thank you