 Welcome everybody. I'd like to introduce Aleksandar Nazonov to you, and he's going to talk about just-in-time code generator inside the NetBeastly kernel, something we don't even have for all architectures in Firefox. Hi. So yeah, I'm Aleksandr. So I'm going to sit during my talk because I have a slight problem with my leg. And yeah, it's going mostly to be a talk about development of this feature, so it's mostly for developers. But even though it's inside the kernel, I will show you not only C code, but I also show you lower code, which can run in the kernel, and it just brings new possibilities. I only started doing lower while preparing for this talk because I needed some graphs, and the easiest way is to generate graphs rather than draw them manually, and I've got quite a few of them. So let's start. So this project started near the Christmas time, and before Christmas, I did some research and I make decision to start working on the project on the 26th of December, and I created a repository on GitHub. It's still there, but at the moment, it's postponed till next March. But I switched to NetBSD3, and one year later, I added the code to NetBSD3, but it was just the beginning. Since then, I did a couple of more iterations including some extensions to BPF language for our new packet filter by request from Medaugus. Yeah, it's still work in progress, and while preparing for the presentation, I discovered how cool is to use law, and I'm going to switch, at least for prototyping, I'm going to switch to law. You'll see. So yeah, I'll start with just basic. I assume everyone understand what BPF is, and to enable JIT, you just need a module kernel, like if you run Intel, then it's most likely a module kernel, unless you compiled minority kernel yourself, and you just need to load BPF module, which automatically loads SLG module. SLG is the library which I use for cogeneration, I'll talk about it in a moment, and then you just enable JIT through CTL tool, and for minority kernel, you need to compile with SLG options and BPF JIT option, and also after you compile, you also need to turn it on, and to use it, you just need to type TCP DOM, then your filter program, and then you can check whether it's compiled or not, just use FSTAT and grep for JIT, and if you see a line, then you have your program running in the kernel is JITed, and one thing to note here is, you can turn on and off BPF at any time, but it doesn't affect running programs. If you had some programs not compiled, then they will not be compiled when you turn on the JIT, they will continue running interpreted, until you restart, just if you stop TCP DOM and start again, it will be compiled. So why I'm doing this? First of all, it was for fun, and there are some implementations in other operating systems and I thought, it would be nice to catch up and have our own JIT BPF interpreter, sorry, not interpreted, a compiler, actually this slide is supposed to show one line by line, but I didn't manage to do it. So yeah, good news, BPF is fast, because BPF interpreter is fast because it's a small compact interpreter, but JIT compiler is several times fast. I didn't do a lot of performance measurements, it's clear that the compiled code is faster, and for a short program, I tested one or two programs for one particular short program, it was four times faster, roughly four times faster on my AMD Boxing on ARM. AMD was running, NetBSD ARM was on Linux, because when I started I used NetBSD at home and used my ARM Chromebook for like this, and I mostly did coding on the train using my ARM Chromebook, but unfortunately it was Linux, not unfortunately because I maintained a platform independent, not independent, but it works on Linux, and I'm pretty sure it's should be straightforward to port it to FreeBSD. Yeah, it was actually mostly user space, but I later just adapted it to kernel space. Yeah, and because I'm using SLG library, and I'll describe what SLG library is, but it has a small overhead, especially for short programs. I think most overhead is in a Prologue, ApiLogue of functions, so it needs to save some registers when you call a generated function. So they have a special kind of ABI to pass in arguments and all this stuff. So yeah, I'll briefly discuss what BPF is, I'm pretty sure everyone is familiar with this, so basically it's a run interface to the kernel, to do packet filtering in the kernel, and it comes with machine, compact machine language, not for real hardware, but for some virtual machine, and it's very simple. So and it's usually wrapped, it's wrapped in the pkplibrary, and this is what people normally use, but you don't have to use it, you could send directly to the BPF device. So when you do TCP dump, it compiles the program from the high-level filter language to low-level machine language, and it sends this to the kernel through the raw interface, and it usually interprets it in the kernel, but it can be compiled. So yeah, as I said, it's very simple. Outside it looks like a function in a special, written in a special assembly which you call, and yeah, it has a single entry, it doesn't do nested calls, and it has no side effect, you just, what you get back is the return value, which is unsigned, 32-bit unsigned integer. So it has two registers A and X, A is the main register, X is auxiliary, I think, register, sometimes it's used, and it comes with a small stack, 16, 32-bit memory words, and yeah, simple instructions like add, multiply, bit operations, yeah, they have one exception because it was, a regional VPF is interpreted and they thought, if they combine two operations, just yeah, two operations, then they would get a speedup for some very common operation. So there are no backward jumps, it always jumps forwards, and therefore there are no loops. This was made for security, yeah, for security purpose because you don't want to hand your system from user space. So no matter what, it should finish in finite number of steps. And there is also a limit of how long your program can be. So it indeed finishes in finite number of steps and it's quite small number of instructions. So you can do, yeah, because it's filtering, everything is around loading from a packet and you can load about a half word and 32-bit word. So there are some examples here. Oops, I was going to do something else. So that index load, sometimes you need to load at unknown offset and you use the X register. So X plus nine means, yeah, add the content of X is nine and just load that offset. And I also asked for a while ago, I asked to disable wrap around. So I think, yeah, everyone knows what wrap around is because for unsigned arithmetic and there is no overflows, there are no overflows, it's modular arithmetic. So that's why I call wrap around. So if you want to do X minus one, load at X minus one offset, you cannot do this by, just let me try this again. Now it does work. So I used it for the first time. So this one, so you cannot do this because it would effectively mean this and it's not what BPA programs do. If they want to achieve the same effect, they would just subtract one from max and just load at X offset. Yeah, it has arithmetic operation also by requests of, I mean the August I added two extensions, but these two extensions are available only inside the kernel and they introduce a very significant change at least to the compiler because those coprocessor functions and external memory, they both have side effects. So it's not any more return value and that's all we get. So you can modify external memory or you call external function which can modify something outside of BPF. But this is strictly limited to the kernel because especially with coprocessor functions you can do a lot of stuff. If you do it from business base, you can break stuff easily. Okay, let's, yeah. Let's take, I'm not sure. It's not very visible in the field and I have bigger, this is the smallest filter problem. So basically it starts by, let me check if it, no. Just trying to understand, maybe I can zoom in. No, that's not okay. So basically here it loads at offset 12, half by the offset 12 and then it does a comparison here. Ah, yeah, first of all, it's for TCP IP, like very simple role. You want to see IP packets. So you just, you know, this program is very simple. Just check the, it's sort of offset, it checks the protocol field and compares with the base time value. And if it's IP, then it will return this value. You short max, otherwise it will return zero. And it's very common to, the last instruction is return zero and the one before is return your short max. Normally a filter program just either reject all or accept all. And this is, you know, how you say accept all, reject all. So, yeah, and yeah, I forgot to mention there. On, in the round, round rectangle on the right side, you see a packet lens, minimum packet lens you need to be able to load that half, half word that offset 12, you need 14 packet, a packet of lens 14 or more. And if it's shorter, then there is a implicit return zero. And there's always implicit return zero if your packet is too short, you know, and I put 14 in the box. That means you need a packet at least 14 bytes long. So now it's more, a bit more complicated filter which corresponds to ICMP. The same to note here is there are two loads. This one is offset 14, and this one is offset, the first one at 12, the second one is 23, all bytes at offset 23. And this is a pattern actually, you see access to packet bytes at increasing offsets. And it's typical because when you go through a protocol layer, they usually just go in, protocol headers go in at increasing offsets. Yeah, just after the case, but not always, but after the case. And it actually can help in some, for some optimizations. And again, you see this return zero return, you shot max on this slide. And yeah, this one is, again, there's two implicit return zero checks. Yeah, I think it's even harder. I'm sorry, I don't know how to fix this. So, I'll source you. There is, no, I think it always should, you know. Anyway, there is, oh, maybe actually I can, yeah, I can do this, but it's not, so this one is for a particular ICMP type, which is ICMP echo request. And just again, some like load branch, load branch at increasing offsets, but not always increasing offsets, sometimes because now I cannot highlight things anymore. And this one has a, actually I can just point. This one introduces a new instruction loaded index offset with indexed by X. And because there is no wraparound, I know the package should be at least 15 bytes, just 15 bytes here and somewhere on top, you see it's actually needs some more than 15, 24. But yeah, the main thing is there are many fallbacks to return zero, both implicit and explicit. These are explicit returns, like basically check the field. It's not, it's not what you want, you just fall back to return zero, otherwise go down the stack or whatever you're doing with your filter, and until you either reach return zero or return or accept or reject. Yeah, and this one again, with one, two, three, four, five implicit, five implicit returns, and oops. And this one actually has two checks. One is, because X plus 14 can also wraparound, it needs an additional check. So there are actually two checks. One is X plus 14, and the second one is for, the result of X plus 14, it shouldn't, plus one byte. This one shouldn't wraparound. Because you don't know the value of X in advance. And this actually, this is the exception to the simple risk rule of BPF. It's an instruction to quickly load the IP header lens. Or just in a ptm with fxf and multiply by four, just two instructions, load byte at offset 14 and do some, just a couple of simple operations. So enough about the filter programs. I'm going to talk more about SLG and what it is and how it works. So basically SLG is Techless GIT compiler. It's BSD licensed and it supports multiple architectures. So Intel X86, both in 64 flavor, ARM 3264, several versions of ARM, PowerPC, MIPS, MIPS 64, Spark 32. It doesn't have support for Spark 64, did. And someone did the Taylor port. Everything except Taylor port is written by Zoltan. And I walk with him on some new stuff and on improvements. Is it okay? So I'm going to talk about the new work in progress version of SLG. Zoltan made a lot of changes and he renamed registers, he renamed instructions. It's just, you know, he's thinking about, he's thinking about the project and just making changes. So the new version will not be compatible, but once he implements it, at some point I will import it into the NetBeats D3 and I will switch to it. So it's Techless means it doesn't use the stack for temporaries when it emulates instructions. And it's kind of, it's like an assembler in some strange, for some strange architecture, but each assembly instruction is actually API function call. So if you want to generate just, if you want to add an instruction to your stream, I make a call, and it has 10 registers and they are somehow mapped to native registers, but mappings are different on different architectures. Up to nine, in reality you have less than, up to 10, you have less than 10 registers, less than 10 scratch and less than 10 saved registers, but in total there are at least eight registers. They share the same pool of real hardware registers and they approach from the opposite ends. And some of those registers are emulated, which means they use tech, but to emulate registers. And I think it's only for x86. They don't have enough registers, just to, while all other architectures have enough registers to be able to, yeah, to, they have at least 10 registers available from the pool. Yeah, you can also, you can, you have access to the stack via SP register, this SLG, actually, constant is just kind of a name of some register, and all registers are named SLG underscore R, zero, so on, and you can also load, emit an instruction to load the stack pointer into the register of your choice. We can just say, I want a stack pointer in R2, something like this. Internal implement, yeah, yeah, yeah. And when you generate a function, you can say, I want this number of bytes available in my stack, and they will be available to you. Yeah, and stackless means for emulating, for emulating instructions, not registers, because when you write registers, you have no choice, but. And for example, it quite often emulates registers. For example, in SLG, you can have three apparent instructions, where destination register is different from two source registers, and not all architecture support, and not all instruction support three apparent, and those will be emulated using additional, you know, more of instructions. So also SLG has labels and jumps, and they are objects, and they will be available as soon as your, as the parent compiler object is available. And also, it's JIT, just in time, I think it means not only you can generate on the fly, but also you can patch your code on the fly. So it has some support for changing constants after you generate it, after you generate the code, you can, if you use special flags to mark rewritable jumps, and rewritable constants, you can update them on the fly. So yeah, the main registers SLG move this one. Yeah, it can move data between registers, between registered memory. Yeah, obviously we have load store of different width, byte half, it's called half. It's not half word, because there is a word, it can be 32-bit or 64-bit, but it's just half. It means 16-bit, and also 32-bit integer, which is called int. And it has the word width, which is either 32-bit or 32-bit platforms, or 64-bit or 64-bit platforms, and it has different addressing modes, like you can load, this means you can load a constant, and actually this constant is not real compiler constant, it's like you have some function in your C code, and you can pass it to, while generating the code, and it will be the constant, because it doesn't change while your generated code is executing, so it's constant, because you are doing it at runtime, not at link time. So you can also do register plus offset, and you can do register plus offset multiplied by two, four, and eight on 64-bit platforms. So, and also you can do 32-bit mode on 64-bit platforms, and this one is particularly important for BPFJIT, because all BPFJIT stuff, or BPF stuff is 32-bit, 32-bit arithmetic, while on 64-bit platforms, on Intel, there is IAX, EAX, so you can work with 32-bit numbers, and you need to use this flag, mark all operations with this flag to make sure they will be as 32-bit, they will be, all operations will be 32-bit. Three apparent instructions, I already explained what it is, and they are often emulated. I still did has double and single precision floating point, I disabled it in the kernel, because we don't normally use floating points in the kernel, and yeah, there's some limitations, and the biggest limitation is you can only call a function, external functions, up to three arguments, if you, and this limitation comes from the fact that SLGT is, it works on many platforms, and like limitation of a single platform, you know, just limit all other architectures, because you basically write one code, and it runs everywhere. It's not quite correct, you need to, like you wrote code once, test it twice, and then run it everywhere, because you need to test at least one 32-bit platform, at least 164-bit platform, and you know, just run right once, test twice, and run everywhere. So this is how it worked for me, actually. And when I did the first version, I was surprised, there was a bug, and I couldn't run it from the first try, but when I fixed the bug, it worked, and I, you know, got home, switched from 32-bit to 64-bit, and again, it was mostly working, you know, I fixed one more, two more bugs, and you know, it was all done. There is a special mechanism for doing fast calls, but it's very specific to SLGT, and I don't use it, and there are some other features I don't use in, in BPFGT code generator. So I'm going to give you some example of, is it still okay to see this? So this is fast 32-bit division, and this code comes from NetBizD, so NetBizD has this fast divide 32-prepare, and fast divide, but fast divide has a different prototype, I'll explain it shortly. So basically, it replaces the division by 32-bit division by 64-bit multiplication by a magic number, which is mal, a variable here, and two shifts. And the real fast divide in NetBizD is similar to what you see here, but it also passes all these four arguments by values in here, and I pass only a single value just, because I'm going to show you some assembler, you'll probably not see it, the screen is too small, but if I pass this object as globals, it's more easier to compare SLG generated code with what the GCC generates. Sorry, this one is important. So, and it's not C code, it's low code, and I wrote it while preparing to the talk, and I think I should zoom it in a little bit. So, this low code gets called from this, crazy. No, that's not, so it gets called, anyway. Okay, it gets called from the host C, so this is embedded, and you call it from the C, and from C you pass these three integers, and they're accessible as this special syntax for assigning arguments of, you received to multiple values. This represents all values passed to your function, and this is a chunk of code, and it's represented as a function, as I think it's very common in scripting language. Then I use, I create a compiler, I create a compiler, and I use Chainian, so this syntax means a method, I call a method of this object, and I use Chainian, so each method returns the object itself, and it's more convenient to use Chainian. So, first of all, I enter a function which accepts one argument, and it uses one saved. Saved means a number of registers, which I say, like if this function will not do any function calls, but if a function does, generate function does a function call, this saved register will be saved, they will have the same value after the call returns, and you need, also need saves for arguments, so that always at least, at least if there is one argument, then there are always at least one saved register, and you need one scratch register, so one saved is S0, this one, one scratch is R0, and yeah, it's a bit messy here, maybe I can just do this, and do it later. So yeah, it generates, you know, just a function prolog, and then you want to, yeah, I forgot to say on the previous slide, the pattern is you do binary operation followed by a shift, and do it three times. The first one is 64 bit mode, you just use mal, I want you to, you know, just generate mal multiplication, the destination will be R0, scratch register, and I want to multiply S0 and immediate value mal, which comes from your host C program. And this one is often emulated because it's three apperant instruction, and then you just shift, you shift by 32 immediate value, and the result goes to the same register, zero, and this instruction, they're similar, but the main difference is that there is I, which means 32 bit mode, so the rest will work in 32 bit mode, because you want to, 32 bit logic, and the last one, yeah, just very similar, binary operation, shift, binary operation, shift, the last one means I'm returning from this function, so this generates an instruction to return from this function, and move the content of register I0 to the return register, I0 is always the return register on all cell grid support architectures, and this means move 32 bit integer, unsigned 32 bit integer, it probably doesn't matter here because when we do initial shift by 32, was 64 bit, you know, just zeroed out high reward, just to demonstrate you can do this if you want to return this mask out the high bits and return 32 bit on 64 bit platforms. So that's how much code you need to generate fast division, and this corresponds to real BPF diff instruction. It's not done in my code yet, because in C, it's quite bigger than this one, I'll show you in a second, and division is not very common instruction in BPF programs because it's slow, I think everyone avoids it, and sorry, I need to undo first, and this is how corresponding C code looks like, and it's not all instructions from the left side, just about three, I think. So if you, the first one is create the compile object, and in C you have error checking, and if a Lua just throws you and assuming you run your Lua code inside a protected call, C should catch an error and report it somehow, in this case, it shouldn't report any error. You have full control of this code from C, and then the next one, just the same, all my functions correspond one to one to SLG, but they don't have SLG underscore prefix, and this is essentially the same except that you can pass the same arguments here, this is like named arguments, and first one, first zero is options, it's always zero, it's not present here, this three corresponds to this, and these three are the same, but for floating point numbers, I don't use any floating point registers, I don't use any floating point registers, and also I check the error, but in SLG, you don't have to check an error after each function, you can do it once in a while, and because it will remember your last error, but it's good to detect errors earlier, but it contributes to the code size quite a bit, and at some point, it becomes unreadable, just this much here, and also this is for 64 bit platforms, because this is using the same with multiplication, on 32 bit platforms, you need double this multiplication, because you need 64 bit multiplication here, and 32 bit will have a different code, and you need some if here, if else, and similarly here, and also there are some special cases, like one of these shifts is zero, or like you divide by power of two, and at some point it becomes just quite difficult to track, like to understand where are you in the code, and what's going on, and it's a bit easier if you do it raw, so yeah, next slide, I think. Yeah, yeah, so I will zoom in each code, and I'm going to compare this object with GCC, it's slow, so this is just a lower code that corresponds to, just to finish the lower implementation, so first you push lower program to the stack, to the lower stack, it's a single line of code plus two lines for error checking, but it basically loads, just loads that lower code to the lower stack, and it will be represented as a function, and then pushes three integers, and then it calls that function, that chunk of code you pushed as a function with three arguments, and it returns one argument, which is the SLG compile object, and then you just, I'm using my lower SLG library, and you just get the pointer to that object from the top of the stack, minus one means get an object from the top of the stack, and every, all objects created inside the lower are managed by a garbage collector, so if you create multiple objects, like jumps, labels, they will all be managed, and if you can use the compiler, when if you make sure it's somewhere in the lower stack, so it's available to lower, it's not dead, then you can use the compiler, and then you can generate code, and when you generate code, you detach your compiler object from the generated code, you can just destroy the compiler object, and because it's managed by the lower, you can disclose the lower state, and it will, you know, it will allocate memory for all objects, for all dependent objects, so you know, it's a bit easier to manage. Once you have, once you have the bindings, once you have the library for managing your, like a subject bindings to lower, everything happens, just manages automatically by lower, and then you need to free the code some point later, but I'm going to show on the assembler slide, I will show you the content of this function, which is generated on the fly, of course, by the subject. Next one is this, you've already seen this, but the pattern is binary operation followed by shift, two more times, and it's actually, I think multiplication by residual, scale through the dual, with the correction step, that's what it is, and so this is one particular implementation, and on the right side, you see a very simple code for the compiler, and the compiler is not limited to this particular implementation in the middle, it can do, you know, other, can apply, try other alternatives, and it's indeed much, much shorter, so this one is a compiler generated code, it doesn't use, you know, this algorithm, because it uses a single shift and different magic, magic number, let me try to zoom in, and real difference here, this is a solid code, and it's a bit bigger, because the push and, you know, just working with stack pointer, not a solid stack pointer, just, you know, this machine stack pointer, and I think this one is moving our safe register somewhere, because, you know, it's to save it, and this also has some overhead, yeah, here, and this move actually comes from, I remember we returned 32 bit, did a special move underscore UI, returned 32 bit, I think you can get rid of this by just not emitting that instruction, and, you know, overall it's similar, just a bit shorter here, because you don't have to lower the constant from the global memory, you know, like you are shifting by one, and here you're shifting by four, you just use immediate constants, for multiplication, you know, it's three instructions, you load your constant, you move, because it's three apparent version, you move safe register somewhere, and just do unsigned multiplication here, so that's, you know, how it looks like, and, you know, it's just right once, oh, yeah, in some cases you need two different implementations, but, okay, it works everywhere, yeah, usually it works everywhere, so yeah, I'm running a bit of time, but because, you know, just, and by the way, I, yeah, all graphs were drawn using LOO and graph fees, that's why I have so many graphs, because it's so easy to draw, and I did this, all this, just into evenings, I think. All right, I can switch back to this one. So yeah, BPF optimizations, I, it's running in the kernel, and I, you know, it's, you need to be more paranoid about doing optimizations in the kernel, and so what I do, I assume that BPF program coming from the user space is actually, is already optimized, like, for example, if it's, it's coming from lpcap library, and it's already optimized, and, for example, if I, like, if I, like, look at some particular program, and I see it's moving value between A and X, register 10 times, and you see in the end it's just a single move, so I don't do it, because this can be done in user space, so you just don't do it, you don't, don't move stuff unnecessarily. Yeah, and I have some exemptions, but they come when, like, when I, when I need something, and like they come for free, or mostly free, then I add those optimizations, and unreachable instructions is a natural step, when you're doing, like, initial flow analysis, it's a, you know, a natural step to see whether this instruction is reachable or not, but in a normal program, you don't see unreachable instructions, so yeah, they should be optimized, and also A and X might be used, uninitialized, and in those cases I set them to zero, but actually if you don't initialize them explicitly in your filter program, the kernel will not accept them, because they will fail the validation step, but yeah, just, you don't have to do it, but I still do it as a safety net, and because I'm not limited to not be anywhere, but be careful that it does this, yeah, yeah. And I do fix number of passes through the filter program to prevent any, you know, just infinite iterations, trying to find an optimal solution, just fix number of passes, and that's it. So yeah, I implement trivial hints, and I think everyone else implements trivial hints, like if X is not available, is not used anywhere in your program, then don't use it, and you will save some instructions in prolog, epilogue of a generated function. Yeah, find use before you need, again, it's not, they will not, if you use before you need them, those programs will not pass validation, but I still do it. Yeah, and I have array bound, the main optimization is array bound check elimination, and I do it into passes, I don't have time to explain it in details, but I'll show you the results. So it applies to packet threads, and because filter programs often read bytes in the traditional offsets, you can save some, optimize away some checks by, you know, just checking the, like if it's going to read the packet at offset 24, byte at offset 23 anyway, then you can just check it at the beginning. And yeah, again, this one is not very visible. I can zoom in, oops. So yeah, basically it needs 14 packets, 14 bytes, and this one needs 24, and I can just move it to the check here and don't check, don't do this check at all. And actually, so this is, it needs 14 bytes, 24, 22, 15, and some unknown offset, but basically it's, you know, just the single check here 24, is it 24 or more than go there, otherwise it's too short, go just straight to each one zero, and you just have some additional check here, and I think that's the two checks you have, like one check here, and only one instead of two checks here, and no checks for this. So you eliminated actually four checks, one here and three of those. That's just what my optimization does, and I implemented it in law. That's why you, so I don't have time to explain you about execution trees and how my algorithm works, but basically, yeah, now that I need to tell you one thing, is actually this, when you look at this, right, okay, I'm writing it out of time, all right. They always fall back to return zero, return zero, and there are no rates at all in this branch, and does it mean you need to disable your optimization, like because in one part it's 24, but in the other part it's just zero, but because laws at high offsets are not return zero anyway, and there are no side effects in BPR programs, you can just assume that instead of return zero, you can say load, load, packet byte, or in this case it's a word, at infinite offset. It's, in reality, it's viewing 32 marks plus four, but it's high enough and higher than any real offset, because this load is indistinguishable. It will return zero anyway, and yeah, it was previously return zero, now it's just loaded this ridiculous offset because they both return zero and there are no side effects. And yeah, that's what you have in the end. Just check here, and one check here instead of two, and all those are optimized the way. And if you change this to return zero here instead of your short marks, the whole program would collapse, but I don't do this optimization, and if this was a part of a bigger program, then I could apply some optimizations to the other part because if this fragment uses some stuff but this one doesn't, I can just take advantage of it, but this is not, you don't see it in real programs that they are not unreachable cause in real BPR programs that you receive from user space, and yeah, I think I don't have time for future optimizations, but, and for Mbuff stuff, and for testing, yeah, I use RAM for testing, it's very modular, and just like every time I had a feature, I just run it through the tests, and the test is both user space and kernel, and there are some differences in the kernel because the Mbuff's, and Mbuff chains, and they're a bit different to work with. So yeah, that's it. Thank you. Any questions? How easy is it to add new architecture support? Well, yeah, Taylor, I was done by one person, I think it should be quite straightforward, and it will take some time, and you need someone who is skilled to do, to know enough about the architecture. Well, I think you can just read the instruction format and ABI specs and do it, even if you don't have enough skills, but it's just... May I try to answer? Yeah. I actually looked at adding Spark 64 support, and it's like 500 lines of code. Yeah, and Dalton doesn't have, I asked him, he doesn't have time, and he doesn't have hardware, but I told him hardware should be a problem, you know, if you want to work on Spark 64, we can donate you hardware. I definitely have it on my to-do list, but it's not an MQ list thing. Any other questions? Thank you. That's it.