 Good afternoon. I hope everyone had a nice lunch. My name is Andrew Bowie. I'm one of the maintainers on the Zephyr project and today I'm going to give an overview of the memory protection features we've implemented in the Zephyr kernel over the past few years. What problems we are trying to solve here and then maybe a word later on some future work that we're hoping to do in this area. So How many of you are familiar with the Zephyr project? Show of hands. Oh good. That's just about everyone. So I'm not going to dwell too long in this slide. Sometimes when I talk about this it's an audience of people who've only you know done development on Linux, but the key points for Zephyr is that everything is in one big physical address space. We there's no virtualization of the address space and when you're calling into drivers or kernel APIs you're doing this directly with pointers to the relevant data structures for those things. So if you're making an API call on a device driver as a reference to that device driver, it's the actual pointer to the data structures for it. So there isn't Zephyr doesn't have really a file abstraction or file descriptor abstractions like Linux does and because of the way that the kernel exists right now this presents this influence the design of how we could implement memory protection on a kernel like this. And specifically what we're trying to do here is basically two categories of problems. The first was just to catch programming mistakes and this is the use case we considered when we were first you know trying to imagine like how are we going to do this and what's it going to look like? Right now in Zephyr if you can make API calls to a driver or a kernel API, if you pass it some garbage you can completely hose the entire system. Even something as simple as a stack overflow can result in some incredibly mysterious behavior where impossible things that shouldn't ever happen start happening because you have no idea that you've just exceeded the bounds of your stack buffer and are just overwriting whatever memory happened to be adjacent to it. Later on we realized that this really should be a full-blown security feature so in addition to just catching programming mistakes we want to be able to sandbox the handling of untrusted code and data and this will let us you know do things like application level network protocols. You can use the MQTT subsystem from a reduced privilege level and that will protect the rest of the system in case there was something compromised about the data that was in there. And in general we want to support the idea of multiple logical applications running on the same microcontroller. You know these days it's become more and more common for several different either completely orthogonal applications or applications that only minimally have to interact with each other but to save cost to put them all in the same microcontroller managed by an OS rather than having discreet microcontrollers or every single one of these functions. And so if one of these starts misbehaving we don't want it to be able to stop on the other applications that are running on the same microcontroller. And we have designed this for systems that have MPU hardware. We have also ported this to architectures like X86 which have an MMU but in cases like that we are just using an identity page table and the only thing we're using the MMU for is to control the access permissions for memory ranges. Later on we are interested in scoping what a virtual memory enabled Zephyr would look like but just like we did for when we brought in user mode for MPU based systems we don't want to radically change what this kernel is because we want this to be a scalable OS. We have lots of users and we'll continue to have lots of users that are working on very very small systems which do not have any kind of memory protection hardware and we want to be able to support them you know those devices that may only have like eight kilobytes of system RAM and so forth. And then scale up to systems with MPUs and then finally up to systems with memory management units and introduce the notion of Zephyr processes which run in their own virtual memory spaces. But that'll be a virtual memory spaces for Zephyr will be a talk for another day because right now this is currently in the early scoping stages and so today I'm just going to talk about what we have now. But like I said we don't want to radically change the kernel in order to accommodate this and so that influence some of the design decisions we made when we brought this in and it'll become clear later what I'm talking about here. So let's see here. To just get through some terminology which hopefully should be familiar to most people. When I say a supervisor thread that's a thread running in the system with you know the the maximum CPU privilege level which if you don't turn on memory protection in Zephyr is every thread on the system. And so what our memory protection feature brings in is the ability to create threads which run at a reduced privilege level and we call them user threads. And the terminology is consistent through most architectures for supervisor and user although you'll it's it does sometimes vary but for the purposes of this talk and the documentation we have on this that's what we're referring to here. Later on I'm going to be talking about memory domains and these are house memory domains. And then we're going to be talking about memory domains. And then we're going to be talking about this that's what we're referring to here. Later on I'm going to be talking about memory domains. And these are how we partition up the physical memory map to grant permission to threads to be able to access. Various partition ranges in the memory map. And I'll have more details on what these are in some slides later. And then I'm also going to talk about kernel objects. Which is basically if the if a user thread needs to talk to a device driver or perform, uh you know OS level things like creating threads or setting up timers or interacting with a kernel level or something like that then I'm going to be talking about the IPC objects, all of these objects are considered kernel objects and we have a system and we communicate with them through system calls and we have a permission system for controlling what threads have access to what kernel objects. And that also applies to device drivers. Ah, let's see. Um, so at a high level what we want to control access to memory, since all the threads are sharing the same address space, we need to be able on a per-thed basis control what memory a thread can access. And so the way we do that is through the memory domains. Um, we want to control access to, um, kernel objects and device drivers. And then, uh, the, at a very basic level, if a thread's running in user mode it should not be able to crash the kernel. So, that means right now, a lot of our kernel APIs either do minimal checking or they, you know, you can pass them things like memory buffers to work with. So we need to have some way of, if a user thread is calling these kernel APIs, of validating their inputs. Um, and one of the things that made this a little difficult in Zephyr is that all of our kernel APIs are dealing with the actual memory addresses of the kernel objects that we're dealing with. So we need to be able to validate these pointers. And so I'll go into a little bit on how we do that. Um, and then for some of these kernel APIs we need additional policy constraints. Um, for example, um, user threads can create new threads themselves, but any thread created by a user thread also has to be running in user mode. And for example, it has to have the same or lower CPU scheduling priority. And so I'll, I'll show how we implement all that. Um, so when you create a thread in user mode, um, by default the only thing it can do, it can read write access its own stack buffer memory. It can have read only and execute access to, um, appropriately to program text and read only memory. And then any other memory range in the system that it needs to have access to, it has to be granted that through using a memory domain. Similarly, when it comes to kernel objects, um, when a new user thread is created, it by default only has access to its own thread object. So if it tries to make a call to a device driver, it'll, you know, it'll get a fatal error when it tries to do that. Um, and it never has any memory access to the, you know, the data structures associated with the kernel objects. All this interaction with device drivers and kernel objects is done through system calls. And then it can get permission on other kernel objects, uh, by being granted that by other threads. Um, at the very basic level the initial, the initial granting permission has to originate from threads running in supervisor mode. But if a thread has permission on another thread and permission on a particular kernel object, it can transfer that permission to that other thread it has permission on. Um, and then, uh, when making system calls we need to do some really rigorous checking. Um, historically in Zephyr a lot of our kernel APIs do not do much checking of their parameters at all. And even if they did rigorously validate all their inputs there's certain classes of parameters where you still need to something extra on top of the implementation of that system call in order to validate it. And that would be things like if you're passing in a buffer to a device driver to copy some data into it. On the other side of that system call we need to know hey this calling thread it actually has access to this memory and we're, it would be able to write to it before we actually do anything to that. And also the pointers to the kernel objects themselves have to be validated. Um, so let me go, let's see. And, in this diagram here, um, this is just to show what a, what a stack object looks like. Uh, this, in the particular case this is on x86. Um, the lower bit of this, the stack buffer and the thread local storage area, a user thread can directly write to this and read it. But on top of this we have some other kernel level data structures that the thread it never has access to. So on x86 we have a set of page tables for that thread. Um, and also a kernel elevation stack for when we do a system call. Um, on each architecture this looks different. On ARM we just have the stack buffer but at the very top we set, we optionally can set a very small MPU region, uh, to prevent writes to it so that if, um, if the thread's running in supervisor mode and it crashes into that MPU region we'll get an exception. Um, but it's really just dependent on a per architecture basis. And also on ARM the sup, because of the alignment constraints of the ARM MPU in order to efficiently pack everything we actually have the kernel stacks allocated in a different part of memory. Uh, so I kind of covered what kernel objects are, um, but they basically cover our IPC objects, you know, some, kSemaphore is in the form of kSem, uh, kPypes, uh, kMutexes which are priority inheritance, uh, kMessageQs, uh, and we also have few texts like objects. Um, I'll get, I have another slide which goes into more detail what those are, but they, we, we can handle the basic wait and wake semantics as well as priority inheriting mutexes. Um, and then any device driver instance, the system calls aren't implemented on a per device driver basis. This would be incredibly cumbersome and tedious for developers. Instead, Zephyr has a notion of a driver subsystem which is a common set of APIs for operations on devices of a particular type, and so that's where we put the system calls there. And then also any OS level things like, uh, system timers and thread objects and thread stacks, those are also counted as kernel objects. Um, so what we need to do is we need to be able to, at runtime, when we're making a system call, if we just give it some pointer value, we need to be able to author it, hey, it'll be say, yes, this is a valid pointer to the type of object that I think it is. Um, the current method that we're doing this is in the build system, we have a script which parses the dwarf debug information in the, in the generated l file to look for all the instances of the particular kernel objects that we want, and then we feed that to a tool called gperf which builds a, what's so called perfect hash table, which contains all the addresses of these objects and then the metadata associated with them. Um, so far this has been working out, there's two potential pain points with using dwarf debug information is one, somebody may want to build and use Zephyr with a compiler that does not emit this kind of debug information, which we haven't run into yet, but it is a possibility so we are thinking about other ways of doing this. And then the immediate, the immediate source of annoyance is that this does add a little bit of time to our builds because the particular elf parsing library we use is called pyelf tools, and while it works, it is not particularly fast. But at any rate, um, we, we have built infrastructure to do this and it generates a table of all the kernel object pointers which can be looked up really quickly. And then there's also some additional data structures which can generate as well. On ARM we have to generate the, uh, the stack buffers for when we elevate into privilege mode. And then for few text types, objects we need to generate the, uh, either the weight queues or the, uh, the kernel level real, you know, priority of herring to new texas which back them. And then also, um, for defining system calls, we've taken some effort to make defining new system calls fairly simple for the end developer. There's also, in all the generation of the rather tedious amounts of boilerplate code which go along with creating system calls, that's all handled by the build system. And then finally, um, I'm going to, when I talk about memory domains, which is the, you know, which is the mechanism for defining regions of memory that user threads can have access to, uh, we have a system in place for, uh, coa, for routing global data structures for your application into their own memory partitions which are then made contiguous and then appropriately sized and aligned and, and so that to correspond with the requirements of the memory protection hardware. Uh, and more, more on this later. Uh, but when we, in the, in the metadata table that we create for kernel objects, uh, the key is the address of the object and then what we generate is a data structure which corresponds to, you know, the metadata for that object. So we have a permission bit map. Um, all the threads in the system have a numeric ID assigned to them and so if the nth bit in that permission bit field is, is active, that means that particular thread has access to that object and can make system calls on it if that thread was being called from user mode. And then we also know the type. So you can't pass a Kmutex to a, a semaphore API and expect it to work and we also take that down to the driver level. In Zephyr all drivers are instances of struct device but we have, um, infrastructure for, for a build timing knowing what subsystem that driver belongs to. So if you tried to make a UART API call on something that was, you know, a counter driver it would appropriately return an error instead of, you know, just doing something horrible. Uh, we also track whether the, uh, particular, uh, kernel logic has been initialized or not. We have a flags field where we track things like that. And then there's also additional data pointers, um, which the semantics of which depend on what type of object it is. So for something like a, uh, a thread stack that will point to the actual size of the thread stack array that got defined for it. Um, for threads it's the actual numerically assigned thread ID, things like that. Uh, let's see. And then this is one of the things we added this summer is, uh, few text-like objects. Uh, I'm not going to go into a deep dive on what few text is are. Um, they're basically a, um, you can, for, uh, locking primitives, if a lock is not contended you do not want to have to make a system call in order to acquire that lock. So for few text's allow you to implement, uh, locking primitives where if there's no contention it's just an atomic operation and then you have the lock. It's only if there's some kind of contention that you'll have to make a system call and have all the attendant overhead of, you know, crossing, doing the privilege elevation and all the other things that go in for a system call. So that's what these are for. Um, and you can define few text's in your code. Um, for example we have a data structure called a system which is a, uh, a semaphore data type implemented on top of few text's. And it has a few text in it and then also a limit count. And this lets, for applications because we have a K-sem type also. That's the older kernel level semaphore. And all the data for the, for the K-sem has to live in kernel memory and you have to make system calls to interact with it each time. Um, this is a much simpler way of defining kernel objects because all the memory for the semaphore can just go into your, into your user code. And the only time you actually need to call into the kernel is when you definitely know you need to wait on it. Um, and like I said before we have a K-fue text type which is just, does wait and wake. And we have a semaphore implementation on top of that but this winter we have a lot of other kernel object types like pipes and message shoes and so forth. And we're hoping this winter to be able to spend some time and actually make those few texts backed as well. Because the more, the more code we can actually run in user context the better. We, we, we have all this code for this IPC stuff that doesn't really need to be up there in kernel space and so we'd like to move more of it down and only actually bring the, have the kernel be involved but we actually need to, you know, put ourselves in a wake queue, that sort of thing. And then we have the priority inheriting SysMute text as well. And those are implemented similarly and then on the kernel side if you actually have to, you know, wait on one of these mutexes it uses our K-mutex data type to actually implement the priority inheritance. This is very similar to how Linux does it. Where if you're doing the, the few text PI calls, uh, when you call into the kernel for that it instantiates in RT mutex behind the scenes to actually perform such locking. Uh, unlike Linux, uh, that we don't currently support letting you wait or wake on arbitrary pointers, we just look for instances of K-futex which were found in the build. Um, in general in Zephyr we are trying to avoid bringing in dependencies and memory allocation as much as possible. Zephyr is an OS which is going for various functional safety certifications and although we do support some kinds of dynamic allocation use cases, we are trying to provide a toolkit which will let you use all these features without having to implement heaps either on the user side or on the kernel side. Uh, so that, that's kind of why this is, when we were trying to do this stuff we're doing it as, at build time and with static allocation as much as possible. Uh, and after just saying that we do support the instantiation of dynamic kernel objects. Um, we don't support doing this for device drivers. We don't support this doing this for thread stacks mostly because Zephyr doesn't have a heap allocator which lets you request a line memory. Um, and then we have a notion of a resource pool for being able to, uh, as a place to draw such memory from instead of having just one global kernel heap where we would pull this memory you can assign groups of threads to different resources pool so that they can't starve each other. Um, and also we have a, we use the object permission bit field is also as a kind of reference count for these kinds of, uh, kernel objects. So when all threads drop their, uh, references to these objects the item is automatically freed. Uh, for system calls, uh, like I said earlier we tried to make this as painless as possible to define these. So in your header file you basically have to do three things. Uh, in your header files where you have the prototype you prefix the, uh, the prototype for the system call with underscore underscore sys call and that's kind of a marker for our build time logic to say oh this is a system call definition I need to generate a whole bunch of boiler plate for it. And then there's the implementation function which is the logic for the system call itself. And this, you know for something, and this is, this assumes that all the arguments are at least pointing to valid areas of memory. And then on top of this there's also a verification function. And before the implementation function gets called if a system call is invoked from user mode it has to get through the verification function first. And this is where we put the extra checking. So for example if, uh, for the implementation of something like KSEM take where you have to pass a pointer to, uh, a semaphore object. It'll be in the verification function where we will do that look up in the, uh, table of kernel objects that we created at build time and find out, okay this is actually a valid KSEM semaphore pointer. The calling thread does have permission on this kernel object in order to make an API call on it. And this semaphore is in an initialized state. And only then does it call the, uh, the implementation function. And similarly for any system call where the, uh, uh, the arguments are, um, like a memory buffer. Like you, you know, here's a buffer where I want you to copy some data in or here's some data I want to pass to the kernel. We would, we would, there's APIs in the verification function to validate, okay this is valid memory that the caller is actually able to read or write to. And, um, if we are copying data by, uh, pointer value it to the system call we would make a, we would make a, a copy here. Because we don't, we want to prevent, you know, the so-called talk to attacks where I think I'm pronouncing that correctly. It's anyway time of use versus time of check type attacks where you could validate the struct, but if you don't make a copy of it and you just take the user pointer to your implementation and you get preempted someone else could change it and then horrible things could happen because you're no longer, your, your assertions of validity of the memory are no longer valid. So it's in the verification is where you would do the, the copying to and from user memory. Uh, and then on top of all this we have the, the, all the boilerplate that gets generated. Uh, I'm not going to go into details and just everything but there's just lots of header magic and that sort of thing. And then there's a, I have a description here of the, of the actual, of how this looks when you get called. One of the things is, is that if you call these system calls from supervisor mode, we bypass a lot of this stuff and we just go directly to the implementation function. It's only when you make these API calls from user mode that we go marshal the arguments in the registers and then issue the software interrupt. Other than that, the way we did system calls is just like any other OS. Uh, there's nothing particularly special or clever about the way we do system calls. Uh, other than I think you know, the boilerplate that we generate for you to make it easy to define but there's nothing particularly magic about this. Um, so like I said, we are targeting devices with MPU hardware. Um, this is an example of the SAM E 70 board and so that is on an ARM, I think it's a ARM Cortex M4. It's either an M3 or an M4. Oh it's an M7. It is? Okay. In that case this, this we actually have 16 available regions instead of, instead of eight. But in general you'll have some regions which are set up at boot. Well, for one thing you have a background map which is baked into the hardware. Um, but on top of the background map we, you would define a few regions at boot for say the program text to say that that's, user mode has read and execute access to that. You might have another region which for the, your read only data which you'll set to be as an MPU region that is read only properties for user mode. And then on context switch you will update an MPU region which corresponds to the stack buffer for the currently running threat. And then the rest of the available regions are reserved for memory domains. And then, uh, just as a side note we also have this working on x86 where we use an MMU but in this case the MMU has an identity page table but otherwise the semantics are the same. You'll have some regions set up at boot and then you can use memory domains to grant additional regions and that gets updated at every context switch. So here's an example of memory domains at work. A memory domain is just a collection of partitions and each partition is just a memory range with access characteristics. So in the example here we have three logical applications running on this system. And then we have four different memory partitions defined. So application one, do you see my pointer when I do it over here? Is there only a pointer here? This will make this easier. Oh well, I'll try to get by. Um, let's say for application one we have three threads. And then two of these threads just need to applic- to access the, uh, the application one partition. That's just all the globals for application one. And then one of the threads also needs to talk to application two. So it has access to application one partition and also this shared region. And then similarly in application two there's a domain which lets you, which lets threads that are members of it access the shared region and then the application two partition. And then all by itself is application three which has its own globals and it doesn't need to see any memory related to the other applications. Um, and so the, the domain definition when you context switch into the thread for whatever memory domain it belongs to it'll program the MPU so it has access to those regions. And similarly in this example, uh, in application one, one of the threads needs to have access to a particular device driver. So it'll have permission on that device driver but the other threads won't. And then the thread in applications one and two that talk to each other, they synchronize on an IPC object that they have permission on as well as being able to access that shared region where you'd have like a memory pool or something like that to exchange data. Uh, and then finally I want to get into our automatic memory domain setup. At the very simplest level you could just, you know, uh, a memory partition is just a struct with an address, a size and access permissions. But what you want in your code is you have an application probably spanning several C files and you want all your globals, you know, in your C files to be routed to these memory partitions. And so it's kind of hard to do that unless you have, the linker is part of this process and can say, okay, we have all these globals for application A. Here, let's coalesce them into a memory region and then appropriately align and size that memory region because a lot of MPU's have some constraints about their, the region definitions that can be a little painful to work with. For example, on the ARM ones, any region has to be a size that's a power of two and a line to its own size. And so the automatic memory partitioning system we have basically handles that for you. Um, you will define, so here, and this is something of a dense slide, but we have a partition foo and a partition bar that are defined in code. And we don't say anywhere what the base address or the size is. We just say we have a partition foo and a partition bar and then for various globals, we have X, Y, and Z and Courge and Grawl, these get routed to these partitions using these KABDMM and BMM macros, which just indicate whether it's data or BSS. And then also, if you need to pull in a third-party library, you can do the same thing. Uh, the script that generates all the linker magic for this will scan everything. It'll, uh, set up the actual partition, uh, ranges in the linker script, and then appropriately pad everything the way it needs to go. And then it'll coalesce them all in memory, and then at boot, it'll zero all the BSS parts for it. So when you boot up, you don't have to worry about, you know, setting, you know, merging all these things together. It's been done for you. And then you just install these partitions in a memory domain and it'll still work. This is not to say this, you're off the hook for everything because it's still a little bit of an art form of setting these things up. For example, for partition foo, we have a character array of 128 bytes plus two integers. So our raw size for this is 136. But this gets rounded up to 256 because that's, you know, the memory regions have to be a power of two. So this is, there's still a little bit of an art into making sure you maximize, uh, your memory usage because otherwise this padding that this thing has to add will mess up your, you know, it'll just waste space in your binary. But this still makes it a lot easier than trying to have Nicole S. at all yourself. And we have some scripts that'll dump a report for you to exchange everything, to show everything for you. And then I just have about five minutes left, so I'm going to skip over the resource pools. But I wanted to say, um, currently we are currently working on improving test and sample coverage. Um, more IPC is coming based on few texas and then pretty soon hopefully maybe the next, for the next one of these conferences I'll be able to talk a little bit about virtual memory and Zephyr. And I'd like to open the floor to floor for questions if anyone has them. Go ahead. The question was, can I elaborate a little bit in the thread model? And yes, um, and specifically our is user space, um, threads considered untrusted? And the answer is yes. Uh, we consider them to be possibly hijacked by a someone who is potentially malicious. And so any interaction between where the user threads are making system calls or communicating with the kernel, we have to assume that it could be completely malicious and he's either trying to crash the system through denial of service or gain access to data it's not supposed to, that sort of thing. Yes, the build is considered trusted. So it's mostly a matter of bad data coming in that's being handled by a user space thread. Later on we will probably start supporting the idea of being able to load, um, either through an elf loader or something else, being able to load actual, uh, code at runtime. We don't support that in Zephyr right now, but this would fit into that model that where all that code would also be considered untrusted as well. You there in the back, sir? Yes, you would have to, so the question was what is the access policy for flash memory? Because there are use cases where the some parts of flash memory may contain very sensitive data, such as a private key for cryptography purposes. So the way the boot, uh, memory protection regions get set up is through, there's a, there's an SOC, C, C in header files and that's where you will set your MPU regions at boot. And so I would imagine for your, for the use case you're describing, you would not have the entire set of read only memory being available to user mode. You would have a more finer granularity on like what parts of the ROM that you would let user threads look at and what parts of it you would not. Does that answer your question? But you have control, basically what I'm trying to say is, is that although I've just for the, uh, the text and read only data, I've described the default permissions, but you have complete control on how you want to arrange that if you need to change it. Yes, in that case, this would be memory they could not access and then there would be, you would, for that threads memory domain, you would then add that address range so that it could, you know, read that. Uh, I guess we've got time for one more question. Go ahead, sir. Uh, for all the architectures that support memory protection, automatic memory partitioning is supported. Uh, we currently support memory protection on 30-bit x86, 32-bit x86, sorry, uh, ARC and, uh, ARM, uh, Cortex-M. And we are, and if anyone, uh, is interested in importing memory protection to your architecture, I am absolutely happy to help and please hit me up on Slack or email. Uh, I think that's all the time I have for today before the next speaker goes on. Thank you so much for your kind attention. If you'd like to talk to me some more, I'll probably be hanging around the Intel booth a bit and then just feel free to grab me during the conference. Thank you.