 Thank you, Michael, for the introduction. I was expecting a roast, but no, a very generous introduction. Michael, by the way, is one of the few students who's ever complained when being offered an extension. So I'm not sure what that means either. But anyway, for those of you who've been sitting here for the last couple of minutes, you might have had a look at this pretty little graphic I pulled off Wikipedia. It's how you do a particular kind of splice. Depending on what mood I'm in, sometimes that seems more interesting than what I'm about to talk about. But if we run out of time, or sorry, if we have extra time, which is also possible, because until five minutes ago, I thought this was a 25-minute slot, but in fact we have 45 minutes. So one or two things will happen. We'll finish early, we can try and find some rope, and we can see if we can do that splice. Or otherwise, what was potentially going to be a sprint might actually be a little bit more sensible-paced presentation. Anyway, splice and VM splice for custom embedded Linux architectures. What? I'm amazed there's actually anyone here at all if there was ever an esoteric kind of topic. That's it. So splice and VM splice are a pair of system calls that made their way into the Linux kernel around the 2617 era, and have pretty much kind of sat there ever since. Not doing a whole lot, they've been sort of fixed, and there's been a little bit of work on them. Not a lot of people seem to actually know they're there and do useful things with them. And one way to think of it is as a generalization of the send file system call. So it's really a way of moving- well, splice is a way of moving data between two file descriptors without actually having to copy it. And VM splice, I'll talk in a bit more detail later on. But it's analogous to splice, but it's a way of getting data from user space pages. So they're mapped into some virtual address into a file descriptor, again without copying and without having to just loop through that data and do a read-write copy. So these sort of caught my eye when they first popped up a few years ago, and I've been looking for an excuse to do something with them. And then last year, a commercial project for my company popped up, and I started thinking send file, and then I thought, hang on, no, splice is really the way to go. I used to play with this stuff and have a look at it and see what I could do. Has anybody in here heard of splice before? Anybody actually used it for anything real and interesting? A couple, okay, we should get together later. I'd be interested to hear what you did with it. And also you can probably heckle because I'm probably going to get some details wrong. So that's in a nutshell, splice and VM splice. These first two slides are just so you know if you're in the wrong talk, you're not interested, you can get out early without wasting too much time. So the second part, custom embedded Linux architectures. So what I've been doing for a long time, starting around 2002, I ported the kernel to the MicroBlaze, which is a CPU architecture that is... It only exists in a programmable FPGA. So you can't go out and buy a MicroBlaze chip set, but you can buy an FPGA and then configure that FPGA to be a MicroBlaze, which is pretty cool, you can do some interesting stuff with that. And as of a couple of years ago, that's pretty much what I do full-time, is we do Linux and we do FPGAs for customers who are building real products, some interesting stuff as well. So this is what I do on a daily basis, I really enjoy it, it's a very interesting kind of thing to do. So that's what I'm going to talk about. What are some of the problems that arise when you're building these systems that splice can help you with? So here's sort of broadly my agenda. I was going to cut a lot of this because I thought I had a really tight slot, but in fact I'm going to drag you through kind of FPGAs and system on chip 101. Again, show of hands, you know what an FPGA is? Yeah, excellent. You've ever done anything with them? Cool, okay, very good. You're probably exactly the same people who are in my talk on Tuesday afternoon, right? It's about the same number of hands. But I'm going to go through some of that quickly, to kind of new to the topic can come along for the ride in this talk. And then I'll talk a little bit more detail about splice and VM splice. And then this recurring theme, this problem that seems to be coming up for our customers, an architectural problem. I should have said already I do embedded, right? So all the things that we deal with, we're not high performance computing, we're not desktop, it's all embedded, right? And so you find yourself when you're working in embedded Linux and particularly in FPGA space, you're working with fairly constrained devices. These are not kind of multi gigahertz CPUs and so on. So some things that you might think are trivial problems from a performance perspective on a desktop level actually require some fairly careful thought when you're dealing with these relatively underpowered architectures. So this theme sort of keeps coming up. And so this is the architecture that we're using to solve it. And I thought it might be a good idea to come here and talk about it. And what we see as a fairly elegant solution, which it won't surprise you, involves the use of splice and VM splice. So FPGA is 101. This is a kind of a technical marketing training sort of picture from Xilinx. They're one of the big FPGA vendors, Altira is the other. They have between them I guess about 90% of the global market and then the 10% is sort of down in the noise with a few other vendors. So you can think of an FPGA as essentially a very large, regularly tiled array of programmable logic primitives. So anyone who's ever done digital design sort of course. These programmable logic primitives, which is that CLB thing in the block there. I don't have the little zoom. But in there there's essentially like a four element lookup table. So it can take four digital inputs, you know, either four binary inputs and generate a single bit output. And those lookup tables are programmable. So that can generate any four bit logic function. There's usually a register. So a register is a single bit memory element. It can optionally just hold in its state the value that was present on its input in the previous clock signal. The Wikipedia page on FPGAs is actually pretty good. So if you want to, you know, check it out, you'll learn a lot of the basics there. And so that's a simplification. That's basically what a CLB is. And each vendor has different kind of variants on a theme. But, you know, in essence that's what it is. So there's lots and lots of these CLBs. But they are also embedded in this incredibly dense routing fabric. This is basically a massively connected network of wires with programmable switches, silicon programmable switches between them. So you can control at a bit level exactly what input on a particular CLB connects to an output on another CLB and so on. And as I kind of labored the point a little bit in my talk on Tuesday afternoon, that is an appallingly low level of abstraction to think about solving real problems. So anyway, what's this got to do with CPUs? Well, CPUs are just digital logic. I mean, there's nothing magical in CPU. It's just a collection of logic gates and registers and storage elements that just happen to create a digital circuit that behaves like a CPU. So you can use these programmable logic resources to create a CPU inside the FPGA. The other thing that probably everyone knows, but an FPGA when you turn it on it does nothing. It just sits there going, you know, what am I going to be today? And so until you program that FPGA with what's called a bitstream, which defines the state of these lookup tables, the state of all this routing and so on, it's just this kind of latent logic, if you like. And once you configure it, it can actually be something. And one of those things it can be is a CPU, but also perhaps memory, either on chip memory, but more likely particularly the kind of systems we deal with talking to off chip memory. So you can have DDR memory controllers and so on. And any custom IO. And so modern FPGAs are now large enough, fast enough, and some of them are cheap enough that you would consider putting a complete system on chip inside an FPGA. So CPU, BIOS, memory interfaces, most of the digital logic layers, so Ethernet Max, serial UARTs, you still need the physical stuff at the edges to do the physical transceivers like the Magnetics and the PHY for Ethernet, and RS2 through 2 transceivers and so on. But you can put quite sophisticated systems on chip inside a single FPGA. And that's what we do. Now again, designing a CPU from a bit level up is a tremendously rewarding educational experience, but it's not a very productive way to actually get products out the door. And you would have seen, I guess John Corbett made mention to it in his Kernel report about the embedded developer's conundrum, which is that these people, these are our customers, they work under the most insane deadlines, really rapid product cycles. So productivity is what it's all about. So a little bit of terminology, an IP core, that IP does unfortunately stand for intellectual property, but you can really think of this as a hardware module or a component. So a CPU in this case is an IP core, an Ethernet Mac or memory controller, perhaps a DSP, some custom bit of stuff that you might build, like a crypto co-processor, for example, those are typically called IP cores. But they live inside the FPGA, and they're manifest or they're created out of these programmable FPGA resources. We talk about soft IP cores, which means that this general purpose programmable fabric is used and configured to create these things. And you can have any number of instances of these subject to the device capacity. But if you give me a big enough FPGA and enough time to compile the design, I'll give you a 100 core CPU system later this afternoon. You can spend the next 12 months figuring out how to program it, but I can build you the hardware in a very short period of time, because we just basically drop them down, cooking cutter style, and then the tools will help us, we'll do the low level kind of hard work to actually turn that into this FPGA bit stream. So this is really nice, because I kind of think of it as playing Lego with computer architecture. You can build these crazy systems, or if you can think of it, you can probably build it. The other thing about these soft IP cores is that they're parametrizable at compile time. And I'll give you an example of what that means with respect to a CPU in a couple of slides time. Oh, no, sorry, I lie, I'm going to give it to you right now. So here is a design file from one of the Xilinx tools. They're kind of higher level programming tools. So obviously we don't design this stuff at bit level. So there are higher level design tools to work with this. And here's a bunch of parameters, so I've just said I want an 8K instruction and data cache, and when I run this and the rest of the system through the tools, it'll just give me a CPU with that size cache. I've said, yeah, I want a hardware barrel shift instruction, I want a hardware divide instruction. And if I don't, the compiler knows that, so it'll just emit multiple op codes to simulate those instructions and so on. I could optionally have an MMU or a floating point unit and so on. So this is what I mean about parametrizable. We've got the concept of a CPU, but also any number of ways that it can be kind of tweaked and tuned. So that's soft IP cores. Now, the other interesting one is called hard IP cores. So I lied before when I said that FPGAs were just an array of programmable fabric because increasingly the modern ones have little blobs of kind of hard, like fixed silicon stuff on the edges, so memory increasingly. But also CPUs, so some FPGA families actually have CPUs in the die, so you don't have to use programmable logic resources to get a CPU in there. These also are a lot faster. You can clock them a lot faster and so on. But this configuration, of course, is predetermined by the FPGA vendor. It's in the silicon when it comes from the fab. You can not use it, but you can't get rid of it and use those transistors for something else. Some memory blocks, some of the older Xilinx FPGAs had a PowerPC architecture in there. Xilinx is the next generation architecture. Very, very sexy bit of hardware. It's got dual core Cortex A9, full system on chip sitting there on the die with a bunch of programmable fabric next to it. Can't wait to get my hands on those ones. But, you know, not yet. They're coming. But anyway, so that's another kind of element of it. So embedded processing in FPGAs is not entirely new. The first sort of commercial viable options came out about 2000, 2001. I started playing with microblades in 2002. I found a lot of compiler bugs for Xilinx. There was a lot wrong with it. But the thing you find is that all of these CPUs, these so-called soft CPUs, they all look pretty much the same. They're all classic 32-bit risk architectures, configurable ALUs. And the reason for that is that this architecture maps pretty nicely into the common underlying structure of FPGAs. So you don't find things like microcoded architectures. For example, it just doesn't really fly. And usually they've got either some kind of way of putting custom instructions into the CPU. So if you've got some code path that you execute a lot, you can make a single instruction and just kind of inject that. Or some kind of code processor interconnect. So you can build a code processor connected to the CPU. Hand off your crypto or whatever to that code processor. The performance of these depends on a lot of variables, but typically kind of 80 to 150 megahertz is the kind of clock frequency we're talking about in these CPUs. So these are not high-performance CPUs. So if you're doing high-performance work, you've got to find that elsewhere in the FPGA fabric. You're not doing serious number crunching DSP data processing in a 100 megahertz CPU. That's a sure way to lose. And that's kind of what has motivated some of this work. So I work almost entirely with Xilinx architecture. That's an accident both of history and more recently commerce. But there are other options. There's some open source options. So some of you might have seen Julius Baxter's talk on open risk the other day, which is an interesting talk and actually an interesting topic. It's an open source CPU architecture. So not only can you run open source software, you can do so on an open source CPU. That's very cool. All we need now is an open source FPGA tool flow, and we'll be good, but that's a long way off. Leon and Rachel are essentially spark clones. So the spark architecture. And one thing you tend to find is that these guys tend to be a bit slower and bigger than their vendor tuned cousins because the FPGA vendors tuned the implementation of their CPU to their very low level FPGA internal architecture. But they're getting better. I mean, some of the results that I'm seeing out of open risk now look promising enough for a further investigation. Commercial soft CPUs tend to be black boxes, right? They don't give you the source code. They just give you this kind of, you know, parameterizable. It's called a net list in synthesis terminology. The Mico32 from a smaller FPGA vendor called Lattice Semiconductor is an exception. I believe they do give you the full source code for their CPU. So in principle, you could actually take that and synthesize it for another FPGA vendor. They might have rules and, you know, preventing you to do that. But it's an open source CPU, but it's not a kind of a big F3 CPU, right? I don't think it's GPL license, for example. Why put Linux on an FPGA? Well, I shouldn't have to answer that question here. But, you know, Linux is just such a wonderful platform and embedded, right? For stuff that everyone wants to have, you know, full network stack, web interfaces, you know, security. I mean, just all the reasons that Linux is kind of making massive progress in the embedded space. And so, you know, we... The initial work to put Linux on the Xilinx FPGA was part of a research project I was involved in. We wanted to start looking at some of Linux's network security and networking capabilities, see if we could accelerate those in FPGAs. And now this is kind of where I make my commercial living. So we are, in fact, a commercial Linux distribution vendor. So we have a distribution with tools and support and all that sort of stuff, and we sell it to people who are more interested in spending their time making cool stuff up here rather than kind of hacking down here. So we've done all the hacking down here, and we just, you know, we have some stuff there. But, you know, once upon a time, embedded developers had to answer the question from their management, you know, why Linux? Why are you going to use this? But that's actually flipped in the last few years. And now, often the question they have to answer is, well, why aren't you using Linux here? Which I think is a wonderful thing for our community. Okay, look, that's my background. This might be a good time if anyone's got any kind of questions. I want to, you know, get some clarification on my lightning introduction to soft core architectures. We could take them now, otherwise I can sort of dive on. All right, you're all either stunned, knowledgeable, or asleep, or some mixture there. Just wait for a mic, please. I was just wondering if you'd ever seen anybody implement a CPU that didn't follow the Harvard CPU principle? Yeah, good question. Has anyone ever implemented a CPU that didn't follow the Harvard kind of risky architecture? I mean, sure, people have done it. People have built custom DSPs. People have implemented stuff like the Texas Instruments DSP architectures in FPGAs. And people have built vector processes. I mean, you can build any kind of crazy, crazy old thing you like. I haven't seen it on a sort of a commercial basis. Some of the, well, actually, there's one vendor. There's a company called Mitreon who have a kind of a high level C to FPGA kind of solution. And what they've actually got is this micro architecture, this massively vector micro architecture. And they cross-compile your C code to fit that architecture. So that's a kind of a non-Harvard vector-style architecture. But, you know, the other thing is, of course, if you're an FPGA vendor and you're creating a CPU, you've got a lot more than just a CPU, right? You've got to do a GCC port for it. If you're doing Linux, you've got to port the kernel. You've got to port the C libraries. You know, there's all this kind of stuff over the top. And so if you start getting too bizarre from an architectural point of view, some of that other stuff gets pretty difficult or more difficult as well. Okay, so man splice. Let's get on to the actual topic. Splice data to and from a pipe. There's the function prototype. So we've got a file descriptor in. We've got a pointer to an offset. We've got a file descriptor out and an optional pointer. We've got a size and we've got some flags. Splice moves data between two file descriptors without copying between kernel address space and user address space. That's pretty cool. There are some other limitations. One of those file descriptors has to be a pipe buffer. All right? So whenever you use splice, you first got to do the pipe. So you create, you know, the two ends of the pipe. And then one of those file descriptors has to be a pipe buffer. And one of them can be any other kind of file descriptor. So it could be a regular, you know, descriptor referring to a file in the file system, but it could also be a network socket. So that's splice. So that's moving. The idea there, of course, I probably should have drawn some pictures. The idea there is that we've got, we've got some kind of file descriptor over here, and it's got some data that we want. It's a device, it's a file, it's a network socket. It's got some data that we want. And there's another file descriptor over here where we want to get that data too. So this is what send file was so good for in web servers, right? You could send file between a file descriptor and a socket without copying the data up into user space and then back down into kernel space. And that, of course, is where the name comes from. It's about splicing two file descriptors together and avoiding that copy overhead. So splice is used when you've got data in one file descriptor and you can splice it into a pipe. And then, of course, what you typically end up doing then is splicing it out of the other end of the pipe into another file descriptor. But the whole point is that if you get a bunch of conditions right, like the alignment of the data and, you know, a few other things, give a few hints to the kernel, it can do so without even copying that memory internally inside the kernel. It can just reuse the page in the page case and just link that page into the pipe buffer, right? So just doing a little bit of pointer manipulation in the kernel instead of actually copying data around. Copying data is something that CPUs are appalling at. It's a terrible thing to do to a CPU. Make it mem copy is, you know, is a horrible thing to do to a CPU. So that's splice. And it's cousin VM splice is used to splice user pages into a pipe. So let's say we've been doing some, we're writing an app and we've got some numbers crunched and they're up in, you know, a malloc buffer. They're up in user space. They're all mapped into VM. And we just want to kind of push those, push that data out over a file descriptor or in fact, same limitation, you can only VM splice into a pipe. But once it's in a kernel pipe buffer and it potentially gets there not by being copied but just by that, that VMA, that page being slotted into the linked list, then you can splice it out to something else rather than doing a large sort of right system call. So there's the function prototype. Again, we've got a file descriptor. That has to be the right end of a pipe buffer. The user pages that we passed to splice, we actually have to fill out one of these little IOV structures which is kind of cool. It means you can actually build up, it's sort of like a scatter gather list, right? You don't have to have it all in a single chunk of memory. You can build up a chain list almost of pages that you want to pass into the splice. And the kernel does all that looping in the kernel, right? So you don't have multiple system call entries for every little chunk of memory that you want to push down. So that's a nice little use of IOVX. How many segments and there's some flags? And I'll talk about some of those flags in a minute. And so, yeah, what VM splice does is it attempts to link those user pages of memory straight into the pipe buffer without actually copying them. Sometimes it can't. There's things you've got to get right with alignment and so on. And there's a few things I'll talk to there in a minute. But if you can meet those conditions, yeah, Mikey. Oh, sorry, wait for the mic. Does that block on... If you're running into a pipe, does there need to be a received side receiving it? And does the VM splice block whilst...? Right. Does VM splice block? I think it depends if you pass on on block. I think you can choose. I don't know. I have someone do man VM splice and put their hand up because it's in the flag's description. I'm pretty sure you can tell VM splice not to block and just return immediately if it hits a blocking condition and hopefully give you some way of telling it... Give it some way of telling you that occurred. Oh, we've got another question over here, Michael. Ten minutes. Whoa, I better get moving. Sorry, just quickly then. Once you do the VM splice, if you modify the memory in your application before it's sort of got out of the pipe buffer, is it modified or are we doing... Is there copyright stuff happening here or...? Okay, good question. Give me one slide. No, no, it means you're paying attention. No, it doesn't. Where's my...? No, okay. I'll do come to it later. So I will answer the question. There are some flags that you can pass in that last argument and one of them is called f underscore splice gift and what that is is a promise to the kernel. It says, I promise to never touch this page again and if I do, then all bets are off. And I don't actually know what happens. I don't know if it's going to seg fault you or if it's just going to be chaos. I don't know. But by passing that flag, you're telling the kernel, look, I'm done with this. You can take this out of my VM scope and just go do whatever you need to to achieve this zero copy purpose. But I do come back to that a little bit later on. So Google around for splice and VM splice. There's a few examples. I know the free electrons guys have got a nice... I think it was one of the guys from free electrons gave a presentation on this in the last couple of years and so you can find their slides. It actually takes a little bit of work. These system calls are kind of hard to use. Actually, it took us a few hours to get a working program just moving data from one file to another using splice. There's a lot of ways you can get it wrong. And so once you get that pattern right, then you just kind of stick to that. So they're not entry level system calls. But anyway, if you're interested, I encourage you to look further into it. Send file internally in the kernel has been re-implemented using splice. So send file now is actually a kind of a wrapper around a bunch of internal splice calls. So, okay, so that's splice and VM splice. So this recurring theme that I wanted to get to in the kinds of problems that we see our customers dealing with in the embedded space. Now, when you're putting... You don't just put a CPU in an FPGA and then run a standard set of peripherals, Ethernet, serial, you know, display maybe. Because if that... If the problem and the hardware that you're creating could be done with such a standard and conventional set of interface blocks, just by a Beagle board or, you know, just get a chip set off the shelf, right? So there's always a reason why you're using an FPGA. You need... There's something about what you're doing that you can profit from that FPGA fabric that's there. You're not just putting a vanilla kind of embedded Linux system in there. It's going to be slower and hotter and more expensive than a kind of an off-the-shelf chip set solution. So everyone who talks to us has kind of come to this path by one means or another. And so one of the themes that we see occurring and we're busy doing the same design for a different customer, actually, is that you've got some IP core in the FPGA and it's sampling data from the real world. The first example and the kind of the customer project that drove this, it was an X-ray dental camera. And so they're capturing these really large images about 48 megabytes or so, which is pretty large for... Have anyone ever tried allocating a 48 megabyte physically contiguous buffer in the Linux kernel? Let's talk later. We had to hack nastily to do so. And, you know, we're doing another one at the moment where they're dealing with weather radar images and so on. And so what these IP cores do is they talk to the sensor and they deem a into main memory, right? And so you allocate a buffer and you say, OK, let me call dump your image data here. And there's maybe some processing happening on the way. But often what this little embedded Linux app needs to do is just push that data out over a UDP socket to some other maybe a client PC that's actually doing the display or post-processing and so on. So in this case, our Linux app doesn't actually have a lot of interest in the content of the data. We just want to get it from a DMA buffer and out over ethernet. And of course this is why splice, you know, makes a lot of sense to me. One, here's your kind of, you know, your undergraduate solution or your kind of first-second-year solution. So we implement some kind of device driver around our custom IP, just a character device. Implement read and write FOPS on it. We open a socket, we open our device, and we just sit in a loop and we just read from the device into a buffer and then we write into the socket, right? So that works, it takes five minutes, gets the data from A to B. Monster fail, right? Monster, monster fail on a small embedded architecture. You might get away with that if you're just, you know... I mean, we had you guys implement this in operating systems, write a file copy using system calls and, you know, everyone does read, write. And, you know, a lot of workloads and a lot of architectures, you don't notice what a bad solution that is. There's a function in the kernel, copy to from user, it moves data from user space and back between kernel space. Touches every byte multiple times. Our app doesn't care about the data, we don't need to see it. And remember, I'm doing this on a sub-100 megahertz CPU. And if we get 50 megabit per second out of that Ethernet port, you know, at the best of times, it's a good result. And so, you know, let's not condemn the CPU to spending its entire life effectively doing mem copy, right? That's really not a good way to go. So solution two, implement a character device driver, UIO, actually, we use UIO a lot. Anyone here used UIO? Yeah, that's the business for embedded because, you know, and particularly when you're dealing with people who aren't kernel experts, because you can do a lot of stuff from user space, kernel drivers, device drivers. So we use UIO a lot. And so UIO lets you basically do, it has a generic M-map implementation. So you can implement multiple memory regions, one for the device control registers itself and the other for the actual DMA buffers that data is landing in. And, but you can do this with a custom driver as well. You just implement the M-map file op, right? And so what M-map allows you to do is directly map a device into the user space, the user application address base, and it gives you a pointer back. And you can dereference that pointer and so on. So here's solution two. So we open our socket, we open our device, we M-map the device. So now we've just got this user space pointer and the kernel is setting up the MMU so that when we access that pointer, it's actually going through to the DMA buffers. So we can just now issue a right system call on that socket but pass it this M-map pointer and, you know, how many bytes we want to do. So this is actually simpler than the previous one. We're not reading. So we've cut out one, one-half of the problem. So this is a reasonable way to go and actually this performs pretty well. This is a lot more intelligent than doing read-write. So this is better but there is still at least one copy to user on the outward path, right, in the right system call. Copy to user is going to be called on that point, a copy from user rather is going to be called on that pointer to get it into the sort of the kernel network stack and actually get it into a sock buff and all that sort of jazz, get it on the way out. So we've still got some pain there that we could potentially avoid. So I'll stop, you know, kind of doing the big setup and I'll just show you the money. And so we open our socket. That's where we're sending the data to. We open our device. We need a pipe file descriptor and we generate a pipe pair, right, so you can write on one end and read on the other. So this is actually creating a pipe buffer inside the kernel and it's giving us two file descriptors, one for each end of the pipe. Then in our loop, we memory map our device file descriptor. We create an IOVec, because that's how we've got to tell splice, and then here we go, VM splice. So we splice from this pointer, hang on, where's my V, of course the IOVec, right. So we splice that area, VM splice that area of memory into the pipe and then we splice that pipe, the other end of the pipe out onto a file descriptor, onto the socket, right. So if you can imagine here's our pipe buffer, we're splicing user memory in and the socket on the way out. And here are those flags that I was talking about, so splice fgift, that's your promise to the kernel. So I am rushing a bit because I'm running out of time, but we can come back to this because we should have ten minutes or so for questions. So there's no copy to from user at all, right, in this flow. There is still a copy going on, I'll talk about that in a minute, but it's not the application's fault, it's actually the kernel's fault. But so the splice fgift is a promise to the kernel saying I will never again touch this memory mapping, all right, and that is why the mmap is inside the loop and we're never doing an munmap. Maybe we should, I don't know, a splice expert might be able to put me straight there, but my understanding is that basically because we've gifted this page to the kernel, we're done with it, we never touch it again, but that means we do have to mmap every time through the loop. Splice fmove is a hint to the kernel saying please, please, please don't copy this data, but just link it into the pipe buffer and let us perform reasonably. Now again, someone more expert on splice, it's a shame Linus isn't here because he could probably set me straight very quickly. There was some mumblings about splice fmove being completely ignored, and there's some details on there that actually I'm not 100% across, but anyway that's, we don't have time to go into that. So what happened, this customer system, I think it was like, we were using like a 75 megahertz CPU architect, you know, CPU clock, and they had this 43 megabyte kind of, you know, multi-megapixel X-ray camera image that they wanted to get from, you know, from the device to a PC over UDP in under 10 seconds. And that sounds trivial, ha, PC server, you know, no problem. So using, we didn't even bother testing read and write. I don't have the numbers for that. Using mmap and write, that took 11 seconds, 11.1 seconds, which was pretty good. When we used the splice vmsplice, so essentially that code structure that I just showed you before, that got down to about 9.6 seconds. That's a win right, so our performance, our goal was 10 seconds, so we're already there. If we hack on the size of the internal pipe buffer in the kernel, we can get that down even less. And the reason is, of course, we've got all this overhead, we're calling the splice as a system call, we've got context switch, overhead and so on. So by, I erroneously thought there was actually a sys control that could set the pipe buffer size, but I don't believe, I believe I'm wrong, because the developer of mine who actually implemented all this stuff, he said, nah, we just got to hack the kernel, we got to change our hash to find somewhere in there. So, you know, we met the target, customers happy, we're happy because we got to solve an interesting problem in what I thought was a fairly interesting way. Just lastly, and then we'll open up the floor for questions and dissing. So we did an experiment last week in the lead-up to this talk. This is a nice thing about FPGA, right? We put a digital trace on the CPU itself and we found out the physical address of this DMA buffer and then we watched the data bus in real-time with a logic analyzer to see if that physical address ever appeared on the bus. So is this physical address ever getting touched again? And we were hoping to see the answer being no, right? Because the Ethernet controller we're using, it's got a full DMA, it's got a gather-gathered DMA engine. So in theory, the Ethernet controller should be able to just pick up this buffer, this packet from memory, and just push it straight out over the wire. But it triggered. So somewhere in the outbound kind of network path, the CPU is actually touching this buffer again. And so we've got to find out why, and we've got to find out if it's a badly-written Ethernet driver or if it's actually something intrinsic in the networking stack. So I reckon we could get those numbers down again, actually. And we would have this beautiful, elegant solution where one device DMAs it into memory, the Linux and the app are just standing there directing traffic, not touching it at all, and then it's just getting DMAs straight back out the other end. So maybe next year I'll come back and tell you if we find it. My sprint is over. So what have we got about five minutes for questions? I have about five minutes for questions. Okay, so let's open it up. A couple of questions. One, just at the top of your head, what was the read-write time for that kind of... Oh, Julius, we didn't even do it. It's just orders of magnitude or something. Yeah, look, it was probably up around the 20 to 30 second mark. Oh, okay. So, yeah. Yeah, but that's pretty good. The other one is, and maybe we can talk about this after, I'm not sure how appropriate this is, but I'm just curious to know as another embedded developer, yeah, have a pretty intimate knowledge of what's going on under the hood there in the kernel. I'll just stop you. This is Julius Baxter. He's Mr. Open Risk, so if you're interested in that, go and hassle him. Yes, sorry, continue. Yeah, but I'm just wondering how you traced everything to figure out where the bottlenecks were and what was going on, and that was an interesting anecdote about how you spied the data bus to check if an address was occurring again. What kind of tools did you use to diagnose the issue and then to really find out what was going on underneath the hood? No, no, I can answer that in a relatively succinctly, I hope. So, I mean, it was just kind of forethought and intuition that made us know that ReadWrite was going to file. So, no, it was pretty clear that something like this. Initially I thought send file, but then I realized we couldn't send file. There was some reason we couldn't use send file. I think it can only move data between regular files and sockets or something like that. Anyway, so, I mean, we came pretty quickly to this being the right way to solve it architecturally. The tool we used to do this bus tracing is an Isling's tool called Chipscope. And it's essentially, it injects a logic analyzer inside the FPGA, and it can just watch in real time any signal you can trigger on any condition. You can use it to watch the world go by a cycle at a time on the data bus, in the CPU, in a device. Once you master this tool, you sort of have this exquisite visibility into the nuts and bolts of what's going on in the system. So, it was Chipscope. Some other questions, yes? So, you're using MicroBlaze on the FPGA? Biggie, pardon? You're using MicroBlaze? Yes, that's great. And so, why did you use a Linux kernel, why do you use a kernel at all, operating system at all? Right, well, because of all the stuff that it gives that they don't have to re-implement. Network stack, web servers, they've got basic authentication requirements and login and so on on the device. So, what we find is that sometimes people start out with these kind of bare metal firmware architectures and they go, oh, it's fine, I'll just write some firmware, I'll sit in a tight loop and use something like LWIP, lightweight IP stack. And they implement it and they get a UDP stream going out and then the project manager, or in fact, marketing, come to them and say, oh, competitor's product's got a web interface configuration and management, can you do that? And so they go, yeah, all right, so they kind of, they try and bolt on like a, you know, listening on port 80 TCP into this thing that was optimized around one task. And then marketing come back and say, oh, we also want to be able to control this with SNMP. And so, you know, we see this happen a lot. And with Linux, of course, that's just trivial. It's like, oh, you want SNMP? Okay, fine, build it and install it. So, but what it does mean is you have to do stuff like this to get around some of the performance impacts of all of that abstraction. So, so it's usually just to reuse the, the massive kind of ecosystem of management sort of middle, it's middleware, right? That's middleware, I suppose. Sure, thank you. David, we're up to the hygiene. Why did you get user land involved? Why did we get user land involved? Oh, good question. So the non-technical answer is because we were, we did this as a prototype. So something we do for a lot of our customers, often it's their first play in the FPGA space, often it's their first play with embedded Linux. And so, we often kind of do these sort of prototype implementations for people, and that's what this was. And we hand this off to them, and they're going to go and build on this framework and build a real product and do stuff with it. And actually, we really don't want newbie customers in the kernel because we can't support them as soon as they start hacking on the kernel. When they run into trouble, we can't debug it without actually using their hacked up kernel. And that costs their money because, right, so part of it's about simplicity. How would you solve it in the kernel anyway? I don't really like doing I.O. in the kernel. Ah, there we go, make your own system call. That's why we did it in user space. Because as soon as you make your own system call, it's such an innocent and tempting thing to do. But, you know, the kernel rolls on without you, so you've got to maintain that forever. But yeah, I mean, I'm sure there would be lower overhead ways of solving this in the kernel. We have time for one more question. Okay. Hi. I know you've already achieved your goal of under 10 seconds. I was at OSDI last year, and there was a paper about batching system calls. I was just wondering how often does that loop go around? Is it page size or...? It's not quite page size, but it's actually limited by how many... The I.O.Vec. Actually, the I.O.Vec has a maximum number of entries. And so, batching system calls would be a potentially quite an interesting thing to do here. Because, yeah, we are suffering a bit of system call context switch overhead here. Yeah, well, there's three system calls in a row, right? Right, exactly. And bringing the M-map into the loop as well. Yeah, it's a good observation. That would be quite an interesting thing to look at. I'm afraid we'll have to end there. Okay. I hardly expected him to manage to put the whole talk in quickly. Aren't you glad I didn't have to do this in 25 minutes? Last talk I went to of his. It was supposed to go for an hour and it went for two and a half hours. But anyway, as a show of appreciation from LCA, we have this little present. Thanks very much, Michael. Thanks, John. Thank you. Thanks very much.