 All right, so I am here to talk about the anti-kernel. As was mentioned, this was originally my PhD thesis work. I've continued it in my spare time since then. So I'm going to begin by making a somewhat unusual and radical claim, which is why my talk is in the New Directions section, which is that having any software whatsoever running in ring zero violates least required privilege. And the reason for this is that if you think about all the different subsystems in a modern kernel, which one of those subsystems needs the ability to talk to private registers of hardware, to read stack data from a user space application, to modify your page tables of user space applications, to write to arbitrary data on desk. Collectively, the OS needs all those capabilities. No one subsystem needs all of that. And so the conclusion I draw from this is that having software running in ring zero, any software whatsoever violates least required privilege. So the thinking is micro kernels are a good idea. They certainly improve the state of the art beyond monolithic kernels. And so the problem is they still have software running in ring zero. So the question is, can we get down there? Is it possible to have an operating system that does not have any software running in ring zero and still provides all the services we expect from a modern operating system? So the one architecture that I drew inspiration from was the extra kernel. This was from MIT back in the mid 90s. And the extra kernel was a performance optimized system. They were not targeting the security world at all. So their idea was that we have a lot of abstractions in modern operating systems. And these abstractions often are bad for performance. So suppose we have some word processor. We probably want to have a file system abstraction. We want to have some caching on there. So if we read some data from the document that we've been working with recently, we want to have that resident so it'll load faster. The problem is, what if we have something like a DHTP server? So this typically doesn't do a lot of disk I.O. It's just persisting leases to disks that if the server shuts down and we reboot the machine that we have some ability to remember who had what address. And so in this case, we don't need a name file system. We just need some place we can store data. Not only that, we don't necessarily want caching because we're not going to be reading from it all that often. Why do we want to waste cash bandwidth somebody else could use? So the extra kind of approach is to split memory or resource protection and segmentation from abstraction such that the driver simply divides resource into blocks and controls access to those blocks. And everything beyond there is implemented by Unispace libraries, microkernel servers, et cetera. And so what the original paper missed because they weren't trying to solve a security problem. They were trying to solve a performance problem. Was that shrinking this attack surface or that was shrinking the drivers minimized the attack surface of the driver stack as well. And so I started thinking, what does the kernel actually have to do? What are the things that we cannot avoid doing outside of the user space process? So as far as I can tell, this was we have to be able to time share a CPU between process. Obviously we can't do that in the context of one process. We have to have the processes provided with some interface to talk to each other and to talk to hardware and drivers. And we need to be able to manage memory at the page level. Everything below page level can be done in libc just like it's done on Linux or whatever. And so then I started looking closely to see, okay, so these things clearly cannot be done in user space. Do they have to be done in ring zero? So the first one is time sharing the CPU. So there are existing tools for doing this. We have barrel processors, we have hyperthreading. So how much more work is it to add a run queue to the CPU such that we can context switch and manage threading completely without software interaction? So in our example here, we've got four hardware threads, one, two, three, four. We've got an array of program counters instead of just one. We've got some bank register file. And so we can time share the CPU between multiple threads and not have software involvement. Again, this already exists, this is nothing new. And so the nice thing about this is it's fairly simple. It's a relatively minimal gate count. If you've already got a superscalar hyperthreaded processor, adding a full run queue to it is basically one FIFO and a couple of gates around there. So very minimal resource overhead. And it also, if you just go with a straight FIFO instead of something more complicated, it gives you a nice deterministic performance, good for hard real-time systems. And the key benefit from this though is we also have no possibility of state corruption going in or out of this system. It is isolated, it does its own thing and it doesn't matter what's going on to the rest of the system on ship, you're still gonna have your threads executing in sequence. And so now let's start thinking about how we can have our applications running on this CPU talk to hardware. So instead of going with the classic shared bus architecture, I switched to a package-explosion network on ship. Again, these already exist, there's nothing that unusual. But the nice thing is if we've got a multi-threaded CPU on here, we have hardware threading. The CPU knows what thread is currently executing. So what if instead of having one network address for the CPU, what if we give it a whole subnet? And what if we have one address for each thread running on that CPU? So in our example, we've got two addresses for application one, application two, and we also have one address on the CPU for management itself. So now we have an out-of-band interface we can use to talk to the CPU rather than software running on the CPU. And also just to be clear, I'm using IPv6 style notation just to denote the addressing in subnets. I'm not actually running IP on the interconnect, obviously it's something optimized for low level on ship stuff. And the addresses I'm using in the prototype are 16 bits long. So a slash 16 is one address, a slash 12 is four addresses, and so on. I'm sorry, a slash 14. And so you can see here, we've also got some peripherals hanging off here, so RAM and flash. So what I did was I extended the CPU is based on the MIPS instruction set. It is not actually MIPS compatible. I reused enough of MIPS that MIPS GCC can target it with a few modifications or with a few flags I should say. And so this has the nice feature that we have a syscall instruction. And if we're trying to avoid running stuff in ring zero, we shouldn't ever have a syscall vector. So we can repurpose the syscall instruction to send messages. So I went with two parallel networks instead of one for the sake of simplicity and making the system easier to analyze. I have what I call the RPC network, which is obviously used for remote procedure calls and also for things like interrupt requests. These are four 32 bit words each per packet, fixed size datagrams, and kind of serves the same function as IO control. So if you want to send a message to flash, you'll send a message using the syscall instruction set A zero to the address of flash, set the high bits of A zero to blocking send, non-blocking send, et cetera. Then A one, A two, A three registers are the payload of your message. And then we also have the DMA network which is used for bulk data plane traffic. So cache lines, DMA of data from network, DMA of data to desk, et cetera. And so while this certainly could be merged into one network, a lot of the correctness analysis is a lot easier if you have data plane and control plane separated. And fundamentally both of these networks provide reliable datagrams. So we have in order delivery between any pair of endpoints, there's no way to ensure ordering, if I send two messages to two different endpoints, congestion further down the network can mean one will arrive first or the other. They don't necessarily arrive in the order they're sending. However, if I send messages from one node to another node, they'll always arrive in feeful order. So if you actually need to guarantee one message gets a packet before another, you send a message to them, they acknowledge, then you send the message to somebody else. And I also have guaranteed minimum QoS so you can do hard real time on here. So each router is guaranteeing that you have one fifth of the available bandwidth per hop. The network is a quadtree of the prototype. So four downstream ports, one upstream port. So you're guaranteed as a minimum one fifth of the available bandwidth. And if other nodes are not using that bandwidth, then you can burst and use that. And so the nice thing about this is that this allows us to do network-based access control. So remember, this is all on chip. It's not exposed to the adversary unless someone is trying to probe your die and fib it and so on. And that is currently outside my threat model. There's plenty of existing, perfectly good research on physical attack countermeasures. I am specifically trying to target software-based attackers. And so the network is accessed via a fully verified transceiver core. And the nice thing about this is once I've proven that this transceiver core will always send correct, well-formed messages, as the system integrator, I can take arbitrary black box third party IP cores, connect them to this transceiver, and I can make certain strong guarantees about system behavior knowing nothing about that net list. So I'm able to guarantee that this block, whatever it is, as a minimum, it cannot prevent any other node from sending traffic. It cannot prevent any packets from being delivered. It cannot observe or modify the contents of anybody else's traffic. And anytime it sends a message, it cannot misrepresent who it's from. So this allows us to use the packet headers for access control and we can determine who sent a message and grant or deny the request based on the source address. So at this point, obviously in a process communication, it's trivial because we have a unique address for each application on the network, so we can just send a message from one application, right back out into the network, and to the other process. So at this point, the only thing left from that list of four items that we can't do in user space is memory management. So turns out we can get rid of this too, because since we have this ability for the individual nodes in the network to authenticate messages based on their origins, now we can do the access control at the physical address level in the individual devices. So remember in the exit kernel, the goal of each driver is to encapsulate a resource, divide it up into chunks and let everything beyond there be handled by the user space. So if all we wanna do is take memory and divide it up into pages and keep track of who owns which page, the data structures required for this are unbelievably simple. It comes out to be a FIFO of free pages and an erase during an owner free page. Very low gate count. So at this point, we can have the processor MMU, no longer be security critical. The MMU now provides translation so that we can map from 16 bit routing address plus 32 bit address within each block down to a 32 or 64 bit address in application software. But this is simply convenient so I can take pages of different stuff and map them at, say, whatever my linker script says my stack should be at or wherever my dot data segment should be and so on. But this does not provide any security because the security is provided at the physical address level. So I can map anything I want and if I try to dereference a pointer that points at something I don't have permissions to, I'm just gonna take off the first time I dereference that pointer. And so now we can have the CPU provide through this management address, provide direct access to my page table from user space. All I have to do is send a message to it saying create a new mapping for the following physical address, following virtual address. And this provides me with some abstraction. I don't have to worry about what the page table looks like under the hood but I don't need any other permissions to do it. And so the RAM controller is now able to be smart and create page mapping or sort of create physical address allocations for me in hardware. So we send an allocator request to RAM, we get back physical address 8-0-0-0, then we send an MAP request to the CPU when say, oh, we wanna map it at 4141000 and then it says, all right, sure, you're good. And so the nice thing about this is that since the data structures are so simple, it is relatively easy to test and verify this. I don't have a full formal correctness proof of the RAM controller at this point. I do have a fairly extensive conventional verification suite and I do have plans in the longer term of doing formal correctness testing on the RAM controller. So far, the main component that does have correctness proofs at this point is the interconnect fabric itself. So the transceivers, the routers, and so on are proven correct, both link layer messages will get to the other end of the link correctly at a hop and then routing is shown correct as well that when I send a message through a router, it'll always go out the right part. So at this point, the question that you're probably wondering is what's left in ring zero? And turns out the answer is nothing. So now we can take all the privileged instructions from the ISA, delete them entirely. We don't need privileged instructions anymore because we now have all the services that a modern multitasking operating system provides either in hardware or an unprivileged user space. So we're running user space on bare metal and now we have what I term an anti-kernel which is an operating system that does not have a kernel. So a lot of you are probably thinking this is just a hardware microkernel. Well, the thing is in a microkernel, you still have some portion of software. It's smaller than a monolithic kernel but there's some section of software somewhere that does still have permissions to access all systems state. So if any of you remember the Intel sysrack bug from a couple of years ago where you could set an instruction mapped at the last virtual address in userland memory, issue assist call, then sysrack would restore all of your registers, leave you in ring zero permissions, return to an unmapped address, then jump to the exception handler with ring zero privileges and registers user space controls. At that point, it doesn't matter how big your kernel is. You've still got some software that has access to all systems state and guess what, that's you now. So the beauty of this system is that since you're decentralizing, there no longer is any single authority that has access to everything. The memory manager just manages page metadata. The thread scheduler just manages thread IDs and pushes the next one out to the register file. Worst case, if you manage to corrupt that thread scheduling logic, okay, it caused a denial of service. Now maybe one thread's not gonna execute and another one's gonna execute for more time but you still have no ability to corrupt that context. So everything is decentralized and each driver or subsystem manages its own state. So if you find an exploit in the API for that one particular module, you can corrupt the state of that module. That still gives you no access to anything else. There's no higher privilege to escalate to. And of course, since these are communicating in a limited and formally defined manner, this allows you to actually do correctness verification on a per module basis and know that there's no other interactions between them. That's one of the things I really like about hardware for formal correctness analysis. If you're trying to prove correctness of one thread in a RTOS, it becomes a lot harder especially if you're trying to prove correctness of one subsystem in a monolithic kernel. You're sharing one address space so one bad pointer anywhere outside the code you've proven correct has just thrown all of your proofs out the window. Whereas in hardware, it's a lot easier to say that barring analog effects like row hammer, if there are not wires coming into your module from some other module, directly or indirectly, they cannot affect your state. So it becomes much easier to say I've got this third party IP, I don't really know what it is, I don't fully trust it but I can at least say there are certain things that cannot do. And so this also means that your TCB is what you make it. If I have some large complex system, say I have a security core somewhere that does crypto of some data coming in from the network or something, I can take my key, I can store it in an on-chip RAM at block. I don't have to persist to disk. If I do persist to disk, maybe I'll persist to disk encrypted with some key of hardwired and so on. But I will also persist to disk directly talking to the disk controller and if I'm not going through the file system, it bugging the file system doesn't give anybody the ability to even corrupt my key, much less erase it or even corrupt it, much less modify it or read it because the data is owned by the disk controller. So the file system has no permissions to the page that the disk controller gave directly to me. So you can actually mix and match individual subsystems within the OS and if you choose not to use that API or that abstraction or that peripheral, they have no effect on your security posture. So the CPU that I went with in the prototype is named Saratoga after a town near where I was doing my PhD. It's, as I mentioned earlier, compatible with MIPS GCC. It is not full MIPS. It's an eight stage barrel processor. So every clock cycle, we contact switch and we force a contact switch every clock cycle. This does mean that your single threaded performance suffers, you only get one instruction every eight clocks at best or two because it's two way superscalar. The advantage of this is that by getting rid of all the pipeline hazard checking, now you don't have to worry about things like forwarding from one pipeline stage to another. So your pipeline becomes a lot simpler and it becomes a lot easier to do correctness proofs. And I'm working toward, I have not yet reached a complete proof but I'm working toward being able to prove data path isolation between the threads. In other words, when I run a thread on one context of the CPU, I can show that it won't affect the state of any other threads except through message passing. And the message passing interface allows us to make strong guarantees about things like I can verify that the message came from who I think it came from and so on. And one of the other things I added in the CPU was a hardware elf loader. So remember the CPU has an out of band management address. So the way you start a new process in the system is you send a message to that out of band management address either from another process in the CPU or from external hardware that says here's the physical address of an elf executable, go run it. And turns out if you're not doing dynamic linking and all of the fancy stuff that elf allows you to do and if you just want to be able to take a simple embedded executable and run it, all you have to do is parse the program header table, create a simple memory image and then go jump to the entry point. And so this isn't really that much work. It's a couple hundred lines of VaroLog at this point, I think. And so this also means that since I'm parsing the elf headers in hardware, I can set the page table to initially have nothing executable. As I load elf segments, I can mark them executable, check a signature over the entire executable and all the program headers. And if that signature is good, allow it to execute, otherwise just free all the memory I allocated. And then I deny the MMU the capability to set the execute in additional pages. So this means you can't ever execute on signed code. This doesn't make exploitation impossible. It means you have secure boot. It also means that if you do manage to find vulnerability in one component of the system, not only your initial dropper, but your entire payload has to be raw. So yes, it can be exploited if you can find enough raw gadgets. It is going to be a pain. But that's a desirable side effect. The real intention of the signature hacking is to have secure boot. The one last subsystem I'm going to mention is that I also have a name server. So this takes eight character host names and maps them to 16-bit routing addresses. The purpose of this is just board portability. So I can have one board that has DDR2, one board that has on-chip block RAM. And abstract these away to the host name RAM. And as long as they expose the same API to the rest of the system, I don't care what address they're in on the network. They expose the same API. They're considered that driver. So this allows us to actually do full operating system abstractions and not worry about all the details of what the system on-chip architecture looks like. So we can move an application from one of these to another and not even have to worry about recompiling it. And so all of the addresses from on-chip, physical addresses for peripherals are baked in at compile time or synthesis time in mass ROM. Anything beyond there is written by signed updates from applications. And so at this point, I'm running low on time, which is perfect. So the prototype is about 200,000. I wrote 187. I think I've done a little bit more since then. Lines of code. That includes things like the custom JTAG library unit test infrastructure. The actual core critical stuff is very small. The networks are under 5,000 lines combined. The name server is about 1,000. The CPU is about 9,000. So there's not really that much code here. And that includes wide space and comments. All the code is three clause BSD, open source. So I would love if somebody would continue this work and write a paper that tells me why it's bad or tells me why it's great. So go for it. Questions?