 All right, let's get going. I think the room is full. So my name is Magnus Kasem. I'm from Intel and we just got a blank screen. Perfect. And you need to log in. I don't know your password actually. Probably just as well. All right. Thank you. So what I want to talk about today is some fundamental new technology. I think you really need to start to work on to realize the vision of cloud native network functions. So what is cloud native? It's like the buzzword of the year. Everything's cloud native. So I want to just give you an overview of what I think it is. At least you might or might not agree with this, but this is my view. So I think of cloud native, I think of many small network functions. Think of like Google stuff where you're like thousands of millions of things just running on a big server. It runs in containers, which is basically the same as a process with the namespace. It is really high availability. Something crashes, other things takes over, other processes takes over, and a crash doesn't really affect anything else in the system. It's also automatic scalability. There's not one person sitting there, pin this there, instantiate another instance of the application on that core. It's just automatically scales. Because if you have millions of nodes, there's no way you can do this manually. It's secure, of course. That's very important today. And one of the most important things here, it's deployable at scale. I mean, million nodes, 100 million instances of applications running. And to get to that scale, it has to be dead simple. There's no way you can do this and it be complicated. It has to be incredibly simple. And you have some kind of load balancing in this picture to the right there. You have some packets coming in. You have some kind of load balancer that balances between servers. Can be implemented by servers, could be in appliance, whatever. Doesn't really matter here. And once you get to this server, you also have some other kind of load balancing in your server to load balancing between the apps or the processes containers that run there. And so there's a hierarchy of load balancing in the system. And best performance is not a main driver in the system. The main driver is all the other things. Of course you have a good enough performance because if you only can process a packet a second, nobody's gonna use this, but it's not the main driver. These other things are more important. And something to note here about my talk today, it's the cloud native systems that use the Linux stack is not a focus of this presentation. When I say cloud native network functions, I mean you want a raw Ethernet frame into your application and process it. So this is the SDM room. So you want something like a firewall implemented there, router, switch, some network function implemented in in your app. I mean cloud native systems using the Linux stack and L7 protocols and stuff like that. It's really interesting, but it's not the focus here. They're probably not even in this room. So okay, so if we look at the requirements there from the previous page, what are the properties you need in your system in order to satisfy those requirements? Well, I listed a number of properties here. So of course you need to be hardware agnostic, just using Linux APIs only. It's really hard to be deployable at scale and automatically scale if you have, because if you're deployable at scale you have systems that are very different. You can't have a million nodes that are exactly the same. It just doesn't work. So you have to be abstract away the hardware. You have to have fault isolation and restartability because we said there are many, many processes here. There should be high availability. There should be secure. So you need fault isolation. Everything needs to be restartable. You can't just restart the whole system. It just doesn't work. Just restart a single process. And of course because we're talking about having millions of processes, you have to be able to say that you can have multiple software versions running at the same time. And it should be upgradable during runtime because you can't just stop the system upgrade all one million instances at the same time. It just doesn't work. You're always gonna have very many different versions of your software running at the same time. And they have to talk to each other and work. You're gonna have multiple versions of the Linux operating system and multiple versions of your hardware and so on in this system because it's just so large. And you need to be able to support many processes per core because one way to achieve automatic scalability is to sprinkle out processes everywhere and just let the Linux scaler take care of it. If some process is more used than another, it will be moved to its own core and so on. And of course in order to realize this you also have power save because if you're just busy polling everywhere, it's really hard to yield to something that's more important or more used. And it should be secure. So we want all security features of Linux working. None of them should actually be disabled. I mean, you can pick and choose, of course, depending on your system, but all of them should be available if you want to use them. And debuggable observable, if we have one million nodes, it's really important it's debuggable and observable in the running system. You can't just insert the GDP and stop the system and try to single step through your problem. It has to be debuggable, observable by runtime tracing. Rotating and switching the kernel to this load balancing and of course it has to be binary compatible and work on any standard Linux. If it doesn't work on any standard Linux, it's hard to be deployed by scale. So because there are gonna be so many different versions of Linux in this big system. So the desired systems that we'd like to point out here is like, yes, okay, so we have a number of cores, we just have a standard Linux on there and on top of everything core, there's one or more processes running. And these are our apps, these are the cloud native network functions. Inside this app, if you look at it, there's a user application and on the bottom of it, there's a packet access library. So think DPDK or VPP or something like that, that it just receives raw packets into this process and the packets access library, provides that to the user app. And if you look at this, if you go back and look at all these properties, if you just put all the drivers in the side of the Linux kernel, you get all these properties by default. This is just an ordinary process running in Linux has all these properties. So the key thing here to realize this is to make sure all the drivers are in the kernel. And you get all of these things, which is really nice. So that's what we're trying to do here saying that, okay, let's say all drivers are in the kernel, what do we need to do in order to realize this? So my goal here for this cloud native network function or this data plane is to have dead simple out of the box cloud native networking for this network function. So you just work. It has to be dead simple. And it has to have all the properties I have done previously and should be supported by all major distributions. So you should go in kernel.org so you get to read that and Ubuntu and SUSE and all these other guys. It has to be binary backwards and forwards compatible. So if you do something today, it should also run on the hardware that comes out in two years because you might still have that piece of, or that binary or piece of software in your system. And it has of course has to have good enough performance but it's not the main driver. And the way we're gonna realize this is to get the traffic into the application, we have AFXDP today to get fast network traffic in there and to have drivers for accelerators, we can use IUU ring. And of course we also have VART-IUNET that's another option to get network traffic with the software abstraction into your app. So we're just gonna use those three interfaces and say all drivers are in the Linux kernel. Okay, so if we say that all drivers should reside inside the Linux kernel, it's really important to say so we have these desired properties but we also have properties we can't have if we're gonna realize all these things here. For example, we can't have SRIV. SRV exposes hardware directly up into a process and then we lose a lot of these features here. I mean, it's not gonna be hardware agnostic for sure. And some other things. And same thing with the user space drivers, it goes with if you have SRV, you need user space drivers. Unless we can agree on a single standard for all mix but that hasn't happened yet. We can't also use pin cores and memory because if you pin cores and memory, well, you can actually use that but you're gonna end up with a system that's so expensive that you can't run it because every single process we have hundreds or thousands of these per core it's gonna have its own pin cores and pin memory. I mean, this doesn't work. It's gonna be too expensive. And also, I mean, it's hard for us to use busy polling and at the same time have many processes per core or choose PowerSafe. Of course, we can do a little bit of busy polling but at some point we have to yield to some other process or to the system. And huge pages. There's nothing really wrong with huge pages but the problem with huge pages is it's a resource, a finite resource not handled by the operating system. So if you think that you have 1000 processes on your system, who's gonna get the huge pages? I mean, today it's just the first guy that comes up and you squabs them, you know. So any resource not handled, any hard or finite hardware resource not handled by the operating system, we can't expose those and we can't use them because then it won't be simple and it's not gonna scale. Another thing we can't use is shared memory because shared memory then we don't have fault isolation and then we don't have restartability. We can have shared memory in some cases. I mean, we can share packets but we probably can't share the control structures controlling the packets because then it's gonna be really, really hard to implement that algorithm. That's still gonna be fault isolation, restartability and have multiple software versions with a shared memory control interface between two entities that can come up and down as they want. Maybe that's solvable but it's really, really hard. So better not to use it in these circumstances. But for, you know, just packet buffer sharing that's very, very likely fine. We can also not have a one-to-one virtual to physical mapping because if you have 1000 processes they all have to get this. It's very unlikely they're gonna get the same in virtual to physical mapping at one point. It's also finite resource then because, yeah. Also just we don't want more than one crossing of user kernel space boundary and this is for the performance reasons. I mean, when you go across that boundary you pay a penalty. You don't wanna bounce back and forth like crazy. If you need to do it, do it once and that's then from kernel to user space. If it's ingress or egress it's user space to kernel space but do it once. We also don't want monolithic software because that means we'll have a big blob and this big blob we have to use in 1000 processes. That's gonna be very expensive too. I mean, you're gonna end up with a system with lots of memory, you know, DRAMIC expensive. Custom kernel modules, same thing. Don't want that because we want to work on any standard Linux. Also can't have a complete kernel bypass because, I mean, we're gonna pretty much coexist with standard applications and these kind of network function applications take raw frames. If we bypass the whole kernel well a lot of these things just start to break down. Maybe we don't get, you know, power saved. We don't get all security features working and a lot of these things start to break down. And we don't want a hard-coded platform because I said this should be flexible. I mean, we might have one process per core. We might have 100 processes per core. They might be pinned sometimes, you know because performance reasons they might float around. I don't know. It's up to the guy that makes or the woman that makes the system. The problem is here that, I mean, the list on the right, I'm saying, don't use this. This is kind of the list of what we're using today. And this of course is very problematic. So this is what like DPD, KVPP and anything else uses today to get high performance networking into a process, you know, a virtual machine. And this of course is the challenge of today's talk. So I'm telling you don't use this. And the problem, I mean, the key problem here is that with SRIV and user space drive is you get, you know, using all the features that the NIC has. You can use all of them to get really good performance. But if you have an operating system, the main purpose of an operating system abstract away the hardware. That's the main purpose, you know, so you can run everything in. So by definition, an operating system cannot expose all the features that the NIC has. It just doesn't work. But so it's always gonna be much less features on the operating system API level than you have on the hardware. So what I'm asking here today is like, what do we need to develop in Linux to make this cloud-nating network function with all drivers inside Linux feasible? So what do we need to work on the next few years in order to get to a point where it's actually you get good enough performance and you get the properties that I told you about. And I'm just gonna summarize some of them here. And we're gonna go in more details on them. But the first one is like, we need to support metadata and offloading in XTP and AF-XTP to get that up to the application because there's a lot of features in the NIC that we really need. I mean, some accelerators really need to get high performance. I mean, just think of, you know, check something or TSO and stuff like that. It does not exist today. You don't get those features in Linux user space. Some of those features can be used by the stack, but they can't be used for raw packets directly in user space. So that's something we need to do. We also need to make it easy to orchestrate and control because what you're gonna end up with, you're gonna probably end up with a slow path. You know, I mean a path going through the Linux app up to some applications. You're also gonna have these kind of applications that I'm talking about here that just takes raw packets and they need to coexist. But they're probably only gonna be controlled by, you know, Kubernetes and Selium and those things through the Linux stack. They're two completely different worlds today. They don't talk to each other and that's problematic. So that's a whole, a couple of slides on that. And something we need to work on is also Q management. Due to the fact that you want, you know, zero copy semantics in user space, you have to dedicate hardware cues of the nick in order to get this into user space. And there's no such management at all today from user space in Linux. They just, it's just hidden and abstracted away and we can't have that now. We really need to expose these resources and be able to allocate and free them from user space. And I'll talk about that in a couple of slides too. And the last one is that we really need a packet access library designed for cloud native and Linux because today they're designed for the right hand column. But I'm saying we need something designed for the left hand column. And those are completely different, you know. And so we talk about that, is that, you know, do we evolve DPDK? Do we evolve something else or do something new? What is it? Or is there a DPDK revolution? I don't know. But we'll have a chat about that at the end too. Okay, so let's go into some details. So starting with metadata and offloading. And this is, I think this is similar to what Thomas talked about that DPDK. You're over there. You have some similar ideas on this, I believe. So you can check that DPDK talk from last year. So what we want here is that, so we have a piece of hardware. It supports some offloads or some, you know, some kind of metadata. And in this case, it's like an RX timestamp. It's an IPv4 header where it starts. IPv6 header where it starts. And it's a UDP checksum if it's okay or not. And these are just examples. So it supports these four. And then you have a piece of software that wants to use a subset of these, you know. And in this case, it wants to use the RX timestamp and the IPv4 header. So the ideas that we had in the, you know, XDP community saying that, okay, let's describe this as this BTF. This is this format that goes with the E-B-P-F of describing the structure of structures or other things. And we describe them in this and we say, okay, we know the hardware has this, the timestamp in this location, the IPv4 header in this location, in its hardware structures. And the software needs these things. So you combine these things and say, okay, then I'll make a structure that, because I know that the hardware will put IPv4 header here and it's U16, I put it here. And then there's some other stuff that I'm not gonna use. And then the RX timestamp comes here. So this is like a one-to-one mapping from, you know, the structure inside the nick. So that's good. So I don't have to move things around. I can just expose this, you know, the metadata section or this section in the nick and I get this mapping. So this is good. And what I then do is like, because XDP hits has a jit. So once I load the program, it's jitted into, you know, X86 or other instructions. And what I can do then is you say, okay, now I know exactly where this IPv4 header start will be. And I know exactly where the RX timestamp to be. So I can recompile the program to exactly point to where I want these things to be. So I don't have to have an Mbuffer in SKB because an Mbuffer in SKB, it's the same. I mean, it's just a big structure of everything that you could potentially use. And then the driver has to fill out these things. You know, oh, I have to fill out it here. Copies in all the data into the Mbuffer in SKB. But in this case, we know exactly what we want to use and we know exactly where it is. So we can just produce code that directly accesses. It's just a load from that. If I wanna use IPv4 header, I just load from that location. The program just knows that. Perfect. So basically no overhead, except you have to access it, of course. But you don't have to go via another structure. You don't have to take the penalty of populating that structure for everything. You don't pay the penalty here. So it thinks it's good. The problem now is that this works great for, you know, you can see that it could work for HTTP because it can jit the program. But for AFX is P, it's more problematic. So as an AFX is P, you create your socket, you do some stuff in your program, and suddenly you do a bind. And at bind time, you decide which device, hardware device I'm gonna bind to. And then you say, oh, it's an internick. Oh, no, it was a melanox nick. And of course, these structures look completely different on these nicks and then you have to, you know, kind of recompile it again. So really what you have to do, you have to do a reallocation at the bind time with AFXP. And that's gonna be a little bit more problematic. So once you do the bind, you have to reallocate your data playing code to actually, in that case, only work on this nick that you bound to. But on XDP, you can do it, you know, at jit time. And another thing that, so for accelerators, they're probably gonna use IU U-ring because IU U-ring is for like storage devices that I, that's not a net dev. It's something that's instigated by user space, tell stack seller to do something, it replies. And so how do we use, you know, metadata to IU U-ring, I don't know, but it's also needs to be extended. So this is a big area. It's gonna take a lot of time to, you know, iron all these things out. Okay, another thing is controlling the fast path from Linux. So now I said we have a slow path, going through the Linux stack and we have a fast path. And actually this picture is a little bit wrong. So the XDP layer should go underneath both of these. It's not only here, it's actually underneath the slow path too. So what we want to achieve here is we want the Linux control path to set up all actions for both the slow path and the fast path. And the fast path, because we're using, you know, AFXP now or IU U-ring or something with dedicated hardware queues, it needs to have, you know, its actions inside the NIC. If the NIC can spread the packets, great. But then you need to set up the NIC. If it can't, you can do it in XDP. So a good thing here is that all packets will pass the XDP layer, even the ones going to the slow path. So if you want something, for example, we want to have a common routing table. We could actually use helpers in XDP. So we can have, because all traffic passes XDP, we can look up our route table inside XDP. And this route table will be exactly the same one as the one that's existing in the Linux stack. So this, we call a helper in XDP. So this helper in XDP, you call in the XDP program, it looks up the state of the internal Linux routing stack and does some verdict, you know, other routes it or passes it to units to use a space or drop it or whatever it is. So we need to really start to develop these helpers. There are very few of these today, one or two. But there need to be many, many more in order for this to be able to coexist. And another challenge is, of course, that if you really want to have high speed, you can't do everything in XDP. XDP is kind of here. It's a very generic and good backup. But if you put everything in XDP, it might slow down. So you want it to be able to push the stuff you can do to the hardware and the things you can do to the hardware into your XDP program. So that's another challenge in this area. Another thing we really have to make sure is that orchestration has to be dead simple. Today, I mean, orchestration, I mean, it's complicated, but what I'm gonna talk about here is orchestrating net dev. So what happens today is that you have a number of net devs in Linux representing ports, usually. And the orchestrator will take a net dev and put it in the pod's namespace. That means that it disappears from this main namespace down in the Linux blob and appears in the pod's namespace. And then the pod can use this net dev. The problem here is that AFXDP, and are you ring another, they need a net dev with real hardware cues. Otherwise you can't get zero copy. They need to be real hardware cues to put the packets in users space directly. So today, then, how do you create one of those? You can't. I mean, you have a net dev today which represents a port. It contains a number of cues all abstracted away. It actually contains a number of cues equal to the number of cores that you have in the system. So today, if you move that into the namespace, you just took away the whole port. And that can only be used by that pod. That's not what you want. You put the cable into your port and you want to fan out the traffic into multiple pods. But I just gave the whole port to that pod. There's actually no way to do this today. You want to be able to split up that net dev into multiple net devs each having maybe an rx and txq pair. And that's what you give to individual ports. So how do you create one of those? Maybe you can create it using like McVilan. So McVilan that takes a net dev and says, OK, I'm going to put a new Mac address on this net dev. And this ad station support that's there, it says that this McVilan net dev, it's going to have a real hardware queue pair. If I can do that, maybe then I can take that thing and give it to the namespace of the pod. And I create a multiple of those and I just chunk them out to different pods. Maybe that's doable, but it's kind of just saying that I'm going to route traffic based on Mac address. Well, maybe I want to route traffic based on VLAN or IPv4, that doesn't work. It's just Mac address. So we probably need something more flexible here. Another thing is the last thing there is that pods also need to have all memory pre-allocated. That's something we have to work on because pods are quite stupid, they are supposed to be stupid because they're so pretty simple. So we need to do lots of pre-allocations here. We can't leave this to the pod. The pod also has to run on least privileges. So maybe it should get all of these things pre-allocated. That's something we need to look into. Another thing is this queue management. Explaining like this, so we have a number of, every single nick contains a number of queues, a number of physical functions, a number of virtual functions. And there's a number of steps in the beginning of when the nick comes up, maybe in the firm or even. That is splits these sets into like PFs that contain a number of queues, PFs that contain a number of queues. In this case it was 64 to PFs and 16 to the VFs. And this is split up, but at some point it's getting represented by a net dev in Linux. This is like E0, that's a net dev. And in this case it contains 48 queues. And what we would like to do here, it's splitting up these 48 queues to different users. For example, the Linux stack, the XDP application or AFXDP applications. So the problem I'm looking at is just these two steps. Given a net dev with queues, how can I allocate new net devs with the subset of these queues and create them? So basically something that should be able to use in the problem we saw before. I want, in that problem I have this net dev, this port, I want to create sub in our child net devs containing a number of queues that I should decide. And the design of this that I'm looking into implementing is to have a queue manager inside the kernel that basically takes care of all this queue manager. So everything that everybody that wants a queue, for example, the Linux stack needs one queue per core, like I said before, I say one RX per core and one TX per core. It goes and asks the queue manager for this. The queue manager then goes and asks the right device to that, hey, you should allocate this. And it allocates them and gives them to the Linux stack. And also XDP needs TX queues. And of course, we also need an interface towards user space here. So we can do this from the Kubernetes orchestration or actually the pod management. So the interface we want there is something that you can say, given this net dev, give me new queue. No, I want two RX queues and five TX queues on this net dev. So what I can do then in the previous example is just create an empty net dev, if I could do that with a device driver. And on that empty net dev, I just allocate queues. How many do I want from my app? I need five TX, one RX. Just allocate five TX, one RX, done. That would be very neat. But this is a big project, because imagine that there's plenty of device arrays out there and they're all written to this. Most of them have been written for not doing dynamic allocation of queues. And we really have to plug this in in such a way that we don't have to change any drivers for the basic support that's there. For example, this path, it should always work and it shouldn't require any changes to the device drivers because we can't change whatever, two, three, four hundred drivers. It just doesn't work. How many minutes left? Minus six, right? Not minus six, plus six. Exactly. Okay, last slide. So we also need a cloud-nated packet access library. And I'll give you to you to think about if this is something new, is this the evolution of DPDK? Is it evolution of something else? So it's the revolution of something that we have. But I'm just listing a couple of properties that I would like to see in such a library. And I know there's some talks later this evening about DPDK moving in this direction. So that's definitely one option. So stay around to listen to those. Properties that I would like, all drivers in kernel space or at least support drivers in kernel space should be set to small libraries. I just want to use the stuff I want. I don't want to use the whole thing. I mean, because apps are different. And if this is going to scale, it can't be just one big thing that I have to link against every single time. No hardware is supposed to use space in the data path and in the control path. I don't care in this case. I don't want to know. And does not force a platform on the users because many people, I mean, there's lots of people that have been in cloud-nated area. They already have a platform. They don't want to change it. So I can't force a platform on the user. It has to be their platform. It has to just be able to fit into whatever platform they use. So it should just be simple libraries. Yes, so no config launch or anything of this because that kind of ties into the platform. And it should work in both process and threads in any configuration. We shouldn't care. Because that's again, part of the platform. Somebody decided that this was a good platform for their application or system or data center. Skip about Mbuffs or SQL base or exposed applications. Just try to hide that because they tend to just explode in size because everyone wants everything. And we need to want applications to be able to crash each other. It should really be fault tolerant. So maybe not rely on shared memory, at least not for control, for packets. We'll be fine. We probably need that for performance. And it has to be debuggable, observable and testable from day one. Don't build this in, we'll start with. So, or it has to be a way to start with. Don't build it in later because nobody wants to pay for it. Usually I mean the rule of thumb is like 20% of your time you're going to spend in statistics in an app. And it's very hard to argue with your program managers or project managers that you want to spend 20% of the time suddenly or remove that from your app. But you really need this because otherwise it's not going to be able to scale and you're not going to be able to find your bugs. And first optimize freeze of use and the right functionality then performance. Donald Knuth said that what premature optimization is the root of all evil. You have to get people to use it first, then you can get it to go fast. All right, so conclude. So cloud native network functions is really completely different. Setting from an appliance or a virtual machine is really different. And most of the challenges I believe you can solve by having just all drivers in the corner. Push them down there, you'll get the properties because the operating system already had the properties you need, but you need to push all the drivers down into the corner for that. Problem of course is Linux is definitely not ready for this. So there's a number of things we have to work on that's going to take years to do. And it's metadata and offloading. It's controlling the data plane from the Linux stack so they can coexist nicely with the Linux stack. Orchestration support has to be dead simple. It has to work with Kubernetes and queue management because otherwise you're not gonna be able to get zero copy and not gonna get the performance that you want. And we have a couple of new requirements on packet access libraries and this we need to work on too. And how to do that is evolution of DPDK. Well, we have a couple of talks on that or is it something completely new or is it the revolution of DPDK? I don't know, it's, that's all up for discussion. All right, thank you very much. Questions? I don't know, we don't have any time actually, I have no idea. All right. Someone there? Should we pass the mic? Just to repeat the questions. All right. You wanna shout? All right. All right, you can just come up afterwards. All right.