 I hope you can. I don't think people can hear me. Can you hear me? Yeah, it's working. I can hear you. Okay. So, yeah, yeah. Now I'm too loud. My name is Ray Kinsler. I've been working on data plane technologies for the last eight or ten years for Intel. DPDK, VPP, and I know TLDK to name a few. Today I'm going to talk about TLDK and why we're interested in accelerating TCP. So why TCP performance matters? Well, I guess really TCP performance is really a heavily a part of user experience. In the wireless network, TCP dominates. TCP makes up about, I think, 95% of traffic in the cellular network. About 4% is UDP and the rest of it's in the noise. TCP flows tend to be small, which is problematic. It means that because it's stateful, you go to all the effort of setting up your flows and then only to have the things transfer about four kilobits, sorry, kilobytes, and that's literally the size of the, that's literally the size of the very most of the flows and then to get torn down again. So it's relatively expensive when there's a connection set up and tear down, but most of the flows actually transfer very, very, very little data. You also have other artifacts in TCP, and parts of the spec that makes small flows even more problematic. Teams like TCP slow start, so you may not be familiar with this, but TCP starts off slowly, it'll send one packet and wait for an act and then it'll send two packets and wait for an act. That's fine if you've got a very, very, very long connection, but if you're only having a short connection, it can be quite, again, make it even more expensive. There's a similar problem in the wireline network and I found this one pretty interesting is that peer-to-peer is used on a relatively small number of broadband connections. 97% of the connections don't use peer-to-peer, about 3% of it. Maybe it's just because I know only all my friends use peer-to-peer, I assumed everybody did, but apparently not. So 97% of connections don't use peer-to-peer and all those 97% connections that don't use peer-to-peer, about 70% plus percent of their traffic is, again, HTTP, TCP. So TCP dominates in the wireline network as well. So, okay, it's a huge part of the equation in wireless networks and wireline networks. So this is where the problems start to creep in. In the world, wireless is a relatively recent advent compared to TCP. Wireless is relatively lossy. Your cell phone will be connected and they're not connected. And that's perfectly normal. But from TCP, it's going to view it looks like it's lost packets. Things like congestion control, getting lost and version, those kind of things, they kick in. Even though it's a perfectly normal thing for your cell phone to go away. In the wireline network, you have things like impedance mismatch. You might have 100 gigs coming into a box, but only a 100-meg DSL line. So then you run into problems like buffer bloats. So your bulk downloads are crowding out your web browsing. So if you're doing, let's say, video streaming at the same time as your browsing the web, which is now on comment these days for people to be flicking around on Twitter, browsing on a TV show while watching the TV show from Netflix. The hell out Twitter responds is a key part of UX. Without intervention, the bulk download is going to cause a very bad user experience for the person flicking around on Twitter. So this has given rise to the advent of user space TCP stacks. People looking, what you typically see is people who are developing applications in the data center, many of them actually develop TCP stacks to go alongside their applications. So if I, you know, one paper that was recently published by Jerry Chu from Google commented that they had something like half a dozen to a dozen individual TCP stacks inside Google. Just simply because people writing an application, they want to have access to new TCP option, so they'll write their own TCP stack to go alongside with their application. And this has become a normal pattern. There's lots of examples of this now. What's a very typical design pattern you'll see in this area, we'll need to take an operating system stack like the NetBSD stack, like the FreeBSD stack, and in the case of Jerry Chu, he takes the Linux kernel, and we're running on top of a fast IO, fast packet IO technology, like Netmap, or DPDK, or Open Fast Pack. So you have your fast IO, and then you run a well-known kernel stack on top of it. That has some nice advantages. As a design approach, that has some nice advantages. One of this is that you get broad RFC compliance. So typically, all these kernels have a fairly broad set of RFCs, which is pretty nice. The other thing is that it has aBSD Sox API. So if I'm taking an off-the-shelf application, like an Nginx, everybody's favorite application choice, Nginx, the kernel is giving you aBSD Sox API, and it's fairly easy to glue the two together. Also, the total cost of ownership. So if I'm using aBSD stack or a NetBSD stack or a Linux stack on top of DPDK or Netmap, then as security problems emerge upstream, I get CVE patches, I get performance fixes, I get maintenance, the stack is maintained. So it's a relatively attractive way to implement a user-space TCP stack. So along with that approach comes with costs. So those kernel stacks make assumptions that they're actually running in the kernel. So you have to jump through some hoops to get those kernel stacks to run and behave themselves in user-space. There's a lot of work being done in this area, but if the kernel threads will make assumptions like that they're running in the kernel and that they're not going to be running in the same context as the user-space thread that made the system call to which they're reacting. So you have to jump through some hoops to get these stacks to behave in user-space. Typical things that people evaluate, look for in the user-space TCP stack, and it's very heavily around connections per second. I lost my train of thought. So in user-space TCP stacks, what really, you'll go back to the previous slide, what really has given rise to the development of user-space TCP stacks is the prioritization of performance and optimization. You have the ability to create an optimized TCP stack in user-space in which you can really optimize for performance use cases like connections per second, your throughput, your request response latency, where you can do things like optimize for core locality. So the actual user-space application that's processing the bitstream is running on the same core in which the bitstream bar is received so that you can do stuff like eliminate context switching. So if I'm using the kernel, I will typically do a select, I'll wait for user input, I'll wait for a packet to arrive, that's one context switch, and then I will do a read whenever the packet does arrive, and that's another context switch, and then I figure out it hasn't actually received enough information to do useful work, and then I'll do another context switch to do another select to wait for more information to arrive. So that's a context switching. So what is TLDK? TLDK is a high-performance, layer four implementation implemented on top of DPDK. It's a complete grounds-up design. So the bottom three points that I talked about in the last slide, the API compatibility and the reuse of a well-known stack that with broad RFC compliance, we don't get those benefits straight up. What we are doing is we're doing a complete grounds-up design to aim for the highest possible performance that we can achieve on the platform. So that means that we're going to leave some stuff on the table, so you won't get the HBSD stocks-compatible API. You won't get the same level of broad RFC compliance. But what you will get is you will get the fastest possible TCP implementation that you can build on the general-purpose CPU. And that's really what we're aiming to create. So in the creation of TLDK, we reuse DPDK design concepts. So we do batch processing, improves IPC. We do things like loop and roll-in. We use vector instructions where we can. We use batch processing to keep the instruction cache warm, so that instead of pushing one packet through the stack, we're processing multiple packets at the same time to keep your instruction cache warm. We're aiming for cache coherency, so that there's an affinity between a given network device between the stack that processes the network device and then also the application that reads the TCP stream that's received on that network device. So we have an end-to-end affinity between the network device and the actual application that processes the TCP stream. We eliminate things like... By not transferring the TCP stream between cores, we eliminate things like stopping hash lines and snooping between cores, so it's much more efficient. We eliminate mode switching, obviously, because everything's in user space at this point anyway. So what we're aiming to support is, we're aiming to support UDP and TCP at this point. We're going to support passive and active connections, and that's basically the client from the server model and the bump-in-the-wire model that I'm going to talk about in a while. So in one of the use cases, I'm very interested in proving out TLDK and is that in bump-in-the-wire performance. We support common TCP options. There's probably just a list of them here. MSS Timestand, selective acknowledgments. We support common features. We have a DDoS implementation based on signed cookie. We'll look at other DDoS implementation mechanisms later. If you don't need DDoS, everything is running inside your data center behind an IDS system anyway. You might want to turn DDoS off. We support things like deny the acknowledgments. We support congestion control. We also support common hardware offload. So you might choose to use RSS. You might choose to use TSA. You might choose to use sign filtering. We use all of these to improve performance and make performance as fast as possible. And then also we're going to provide code examples. One of the nice things you get with DPDK is when you download DPDK, you get some nice sample code. I've written some about myself that actually shows different use cases like L2 forwarding, L3 forwarding. We're going to do the same with TLDK. We will have a sample directory that will show off use cases like transparent proxy. I'll talk about some of those use cases to show off later. So TLDK provides a socket-like API. So it's a socket-like... You'll be familiar with some of the API calls that will look familiar to you. I guess the naming of it will look familiar, but actually again, it's a parameter. That's where the differences start to creep in. So there's three common design contexts with TLDK that I want to jump into. Oh, sorry, the first one before we go there. One of the easy reasons people talk about this is we're turning the network stack upside down. What we're really trying to do is make the protocol being driven by the application, so the application driving the protocol and some of the protocol driving the application. So what you typically see with TCP applications is they sit and I wait for a socket, and then when the socket, when something arrives in the socket, I wake up and I do something. When I've received enough information on the socket, I will go back to sleep. I'll go for it and I'll go back to sleep again. And then next time more packets arrive, I'll wake up again. The design power of the TLDK is we will continue to do useful work while there's useful work to do. So we will back... So we will do batch processing. So we will batch read packets off the network device, and then we will batch read... We will iterate through streams and then iterate through packets received on streams. So we will continue to pull and continue to work in the DPDK style and the batch processing style and continue to work while there's useful work to do. We'll never block. So in that way that the application is driving the protocol, not the protocol driving the application. So three core design concepts. The first is your context. Your context is really an individual instance of the stack. You typically have one context per core. You might have more than one context per core, that's okay too, but there is an affinity between the context and the given core. You won't have devices. Now, all we're doing is implementing a layer four implementation. We're not doing a layer two. We're not doing a layer three implementation. You'll get those from somewhere else. You might choose, in many cases, people have their own on top of DPDK that they don't want to use, but all we're doing is a layer four implementation. It has a notion of there being a device underneath and that device has capabilities. It may support IP. It may support certain hardware offloads. So TLEK understands that there's an IPv4 implementation, understands there's an IPv6 implementation, understands that some devices support things like TSO and can take advantages of them, but it's not tightly coupled to any implementation. It's loosely coupling as achieved by using abstraction through the TLE device. And then the final, the last concept is a TLE stream, which is essentially your TCP UDP stream, and that's what the little asterisk is. And that's really your L4M point. It describes a source and destination IP address and source and destination port. The last thing I'd say is the streams are actually, ordinarily you would process a stream on a given core. So for the best performance, you want to have that end-to-end tight coupling between a device, a context, and a stream on a given core. But the API, actually, what we call the actual, the streaming API is thread safe. So if you did want to read and write another stream from another core, that's actually perfectly okay to do. So TLEK has this notion of a front-end and a back-end, and the front-end is the layer. You can almost think of TLEK as a sandwich in the middle. So the front-end is the API to which you are your application against, and then the back-end is basically where your layer two, layer three, and implementation exists. So the back-end API, so you might think about at the very, very low level, you'll have DPDK or some fast packet IO technology. It receives the packet, pushes it through your IPv4 implementation or IPv6 implementation, and then you do a bulk Rx into TLDK. So you do a bulk push of the packet into TLDK. And then TLDK will take care of this filling out of the packets into multiple streams. So then in your front-end, you will go and if you're a, you will go and they'll say, okay, is there any signs for me to process? Is there any signed packets for me to process? Okay, there's some signed packets for me to process. I better go, accept the new connections I want to accept or I can reject the new connections I want to reject. Is there any reset packets I need to process? Yes, there's some reset packets I need to process. I need them to both close those connections, to both close those streams. Again, I have API with the kind of semantics you'd be familiar with, like send and receive and read and write. I read and write API support, IO vectors, which you'd be familiar with, as send and receive bit streams. And what I talked about before, then, thread-safe, like this is typically on the left-hand side, is the most optimal setup where you have a back-end and TLDK context and your application, all of which runs on a given core. We also support the right-hand side where you have a back-end in your context on one core, which your actual front-end can read and write the streams from other cores. So this was the bit where I was supposed to talk about both UDP and TCP performance on top of TLDK. As will typically happen with these things, there was a whole drag around the XEA licensing costs for the benchmark TCP. And then there was also an unhappy incident involving an IRF gun. That delayed the whole thing. So I only have UDP performance numbers today. I will talk about TCP performance as well in a moment. UDP performance is about 7 million packets per second, but one thing you'll know, and this is really what we're in for, is we're linearly scalable. As you add cores, you get 14 million packets, 21 million packets per second. So we're linearly scalable and the same UDP applications typically are. The more cores you give, you get a linear scaling in performance. Our initial target for connections per second is about half a million to one million connections per second per core. That means that we'll be able to... I have numbers to suggest that we're training positive for that. So that means that you'll be able to set up and tear down half a million to one million connections per second per core, which is a pretty exception performance. So some of the use cases that we're aiming for, DPPK workloads are typically proved in our forest and network nodes, and hopefully TLEK won't be any exception to this. So we're going to start off at TCP aggregation points in the network, and there's two I find interesting. The forest one is transparent proxies. And transparent proxies are pretty widely deployed in the cellular network. Transparent proxies are what take care of the fact that stopped TCP going nuts when your cell phone disappears and reappears on the cellular network. So what typically happens in the transparent proxy is that it'll watch the sign go by between your cell phone and the server, and then it will create a shadow connection. Once that shadow connection is set up, it will do acts on your behalf. So as your cell phone communicates with the server, as the cell phone communicates with the server, it will send acknowledgments from the server, which will send acknowledgments as the server which are in packets. That means that if your cell phone disappears for a few milliseconds, and it's not there to send the acts, that the server doesn't start re-sending those packets unnecessarily. And then when your cell phone appears in the network, the TCP, back in the network, the TCP proxy sends on those packets. So it stops things like, it stops the typical loss control mechanisms of TCP kicking in. So one use case that we're going to go after is reverse proxy load balancer. So this is more of a data sender use case. So in the data sender, you typically have a reverse proxy load balancer that sits directly behind your front end. What it does is it terminates HTTPS, it serves up static content itself, and then it forwards on requests for dynamic content to red servers behind it, and then load balances between them. So again, it is another example of a TCP aggregation point in the network that has stateful requirements. So both TCP transparent proxy and reverse proxy are kind of on our short-term to-do list. So in summary, network operators and data centers are all optimizing TCP to improve the end user experience and improve overall utilization, network utilization. TCP stacks are growing in popularity. I think I counted 11 TCP stacks on top of our DPDK and Netmap before actually while I was in the process of writing this presentation. You're probably familiar with some of them. You're from DPDK, ANS, LiP, CloudNet. There's a TCP implementation in VPP, there's CSTAR, there's MTCP, there's lots and lots of TCP implementations in user space. A lot of whom are reusing the FreeBSD, our NetBSD, our Linux kernel implementation, all of whom are trying to implement a BST SOX API. That's not where we're going. We're going for the fastest possible TCP performance on top of a general purpose processor, going after network node use cases. If those are things that are interesting to you, I'd love to hear from you. On the slides, I'm going to be available afterwards. We looked for a brand new open source community and we're eager for contributions and all the feedback telling us we're going the right way or the wrong way or contributing use cases or contributing code. We're very eager for fellow travelers to follow us. I have a few minutes for questions if there's any questions. I'm sorry? Okay, so the question is, do we plan any MTCP support? Oh, making that TCP support. That's an interesting one because that was one of the ones I'm interested in. Not currently, but I'd love to hear more if that's something that's interesting to you because I know it's heavily used, particularly in cell phones, right? Improved performance and corporate mobility. So if that's something that's interesting, I'd love to talk about it afterwards. Any other questions? Hopefully next week. Hopefully next week. Should all of you be out? Any other questions? I think they use that in just a slightly different way to talk to the cell. So the need for a slightly different way to just drop the fields. Then he goes to the other one. All right. Sound. Let's see how he gets you pinned on here. Sound. I can't hear anything. Yeah, talk a little louder. Can you hear me? I know. Is this working?