 So, hello everyone, I'm Mon Jackson and I'm very honored to be here. And this, that is me and this is Huifeng. But unfortunately, he cannot come to the Prague this time, so I will present all the slides. And we both work for ARM and our work is quite related to Zephyr so we are also helping to maintain the Zephyr project. So today my topic is introducing a hardware level device isolation to Zephyr. And here is the outline of this presentation. Firstly I will talk about the background about why we need hardware level isolation. And next I will give a quick introduction to the technology we use. And then I will talk about the Zephyr device model. Following that I will talk about how we introduce the hardware level isolation to Zephyr. And lastly I will provide a simple summary. And firstly why do we need hardware level isolation? It's based on a simple observation that more and more DMA device drivers have been added into the RTOS. Well I think there are two points behind the trend. The first one would be the number of the DMA devices on low power platforms is increasing. For example the IoT industry, the variety of the DMA devices is increasing. So the corresponding drivers will be added into the RTOS. Since the RTOS is still the mainstream technology to manage the low power platforms. And the second point would be there is an increasing demand for running RTOS on high performance platforms. For example the self-driving requires a platform with high performance and safety at the same time. And a better solution would be running the Zephyr on high performance platforms to manage the whole platform. So as we all know the high performance platforms always come with multiple DMA devices. For example the PCI devices. So back to the Zephyr as we can see more and more DMA devices drivers will be added into the Zephyr. This makes Zephyr more popular which is great but it also brings new challenges. So naturally we will ask is that possible that the DMA devices will bypass the system access control and if yes how should we do to restrict the DMA devices. So for the first question after some investigations I found that many many DMA devices intentionally or unintentionally can break the system. For example some bugs in the Wi-Fi chips can lead to permission leaks and thus enable the remote control. And also we found that many DMA attacks that can exploit the uncontrolled DMA devices to steal data, install back doors and even modify the system. What's more we all know Zephyr supports many platforms right and I think this potential risk will become more critical. Therefore back to the second question. Since the DMA can bypass the system access control how can we restrict the DMA devices. Well before discussing how to address the issue let me briefly explain what the problem is. Currently Zephyr uses MMU or MPU to restrict the thread memory regions to protect the system. Well as you can see the left diagram during the context switch the Zephyr also switch the memory regions for the upcoming thread. As a result the current thread can only access its own thread and its own memory regions and is prevented from accessing other regions. If a thread encounters a bug or something else. For example the thread 2 wants to clean the thread 1's memory and the MPU or MMU will generate a data board to the system so the system can effectively block this dangerous behavior. But the MMU or MPU can only restrict the memory accesses from CPUs, the memory accesses from DMA devices are not protected by the MMU or MPU. As shown in the red diagram if a DMA devices encounters a bug or it is a malicious one wants to clean the thread's memory which is not allowed in this case MMU or MPU helps nothing. As the protection provided by the MMU or MPU does not apply to the DMA devices. This limitation tells us that we need another approach to restrict the DMA devices. To mitigate the issue one solution is to leverage the hardware level protection such as SMMU from ARM, LMMU from Intel and other similar technologies. All of this can restrict the DMA devices. Taking SMMU as an example the SMMU allows the system to share the page tables with DMA devices and it's powerful, it's widely used in Linux and HEPAWISER. It can not only isolate the DMA accesses but also eliminate the requirements for physically continuous page for the DMA buffer allocation. It can also enhance the virtualization capabilities for the HEPAWISER but SMMU is too powerful for Zephyr. Re-supporting SMMU in Zephyr will increase too much overhead because our goal is just to restrict the DMA devices not other some virtualization features. It is also appropriate to introduce SMMU to the low-power platforms due to the power consuming issues and costs. So what we do to address these challenges? For the first concern we partially enable the SMMU. For example setting a linear mapping, address mapping, this will allow the SMMU to reduce unnecessary overhead. And for the second concern we add a subsystem to manage DMA device isolation. This interface allows the driver to uniformly access the SMMU driver or other similar technologies. And also it makes it is easier to extend in the future to support more similar technologies because I think we eventually will have a isolation technology designed for the low-power devices, low-power platforms, I mean. So since we have, apart from the isolation, since we have introduced the SMMU driver, we actually also enhanced the capabilities of Zephyr. For example, Zephyr can be as a HEPAWISER or make it possible that Zephyr to support more platforms. And okay, next I will briefly introduce the SMMU. Similar to MMU, the SMMU also performs the address translation and the access control. But differently, MMU translates the addresses from CPU, but SMMU translates the addresses for DMA devices. So typically a platform always supports multiple DMA devices, thus the SMMU also needs to support multiple translation tables. Each DMA device needs a translation table, as you can see in the diagram. And so SMMU performs the translation for each device separately. It is also allowed that for all the devices, share one translation table, but it's not recommended from a security perspective. So this diagram simply illustrates how one access is translated by SMMU. Similarly SMMU has a stream table and the multiple CD tables and the multiple page tables. Access from the DMA device contains a stream ID, a sub stream ID, and a watch address. The translation process involves several table lookups. And when an access arrives, firstly SMMU uses stream ID to find the stream table entry in the stream table. And the stream table entry will point to the next level, which is a CD table base address. Then SMMU will use the sub stream ID to find the right CD table entry. The CD table entry will finally point to the page table, which is used to do the translation from watch address to the physical address. And the translation from VA to the PA is exactly the same with the translation performed by the SMMU. So the SMMU also provides some flexibility to bypass the entire translation. In this case, the VA equals PA. And it can also bypass the sub stream ID if needed. And it can reduce the translation table level by using block attributes in the translation table, which can accelerate the translation process, just like SMMU. Additionally, SMMU also supports the second translation. But this feature is really for hypervisor. We don't use this in our case. And when a platform has an SMMU, the software actually uses it to restrict the DMA devices. As the left diagram shows result SMMU, the DMA devices can actually access the memory randomly. But things will be different if there is a SMMU. As shown in the diagram, the DMA devices can only access the regions, which is allowed by the SMMU and is prevented from accessing those other regions. These capabilities help to protect the system from uncontrolled DMA accesses. Next, let me provide a brief introduction to the Zephyr model. I think the official document has already provided a very clear and detailed introduction to the Zephyr device model. So I'll just introduce it very quickly. I also borrowed the diagram from the official website, since it's so clear. Actually Zephyr creates the subsystem, which is designed to define device-independent APIs for the application to use. As shown in the diagram, on the one hand, the application is simply programmed to those generic APIs, and on the other hand, the driver simply implements a subsystem by implementing and populating an instance of the APIs. And the device driver data also allows the Zephyr to support multiple driver instances with a single driver. This is an example of the Zephyr device model, and the subsystem defines the API and the application uses them to call the drivers, and the driver just populates the instance of the API and the device driver data allows the driver to have multiple instances for the devices. And this is a very concrete example of two-year-olds on one platform, which is very common, so I will not introduce the details. And next I will explain how we introduce the hardware level device isolation to Zephyr. So this is the subcontent. I will start with the overview design, and then I will talk about the DTS design, and also the subsystem design, then we'll talk about the implementation we have done, and also discuss the latency issue. In general, the left diagram indicates the current Zephyr device model. After we introduce the device isolation subsystem, it will become a framework shown as in the right hand side, and the blocks in orange color indicates the modifications we made. And the device isolation subsystem mainly consists of two parts. The first one would be allowing the driver to register the devices into the system, and the second would be allowing the driver to restrict the memory for the devices. Also we use domain to define a address space, but for now we only have one default domain. With the device isolation subsystem, the DMA driver can restrict the access regions of those DMA devices without considering which implementation is underlying. This framework also allows the system to support more multiple isolation technologies making it easier to extend in the future. And Zephyr used DTS to describe the devices of a platform. So we need to define how DTS is used. Taking PCI devices as an example, Zephyr needs to know the SMMU node information, the PCI device node information, and also the relationship between them. Thus, the DTS should provide SMMU nodes with the base address and the size of the MMIO registers, and also it should provide the DMA devices node with essential information like PCI addresses and the IRQ information and the device tabs. As importantly as showing the red diagram, the PCI should provide the SMMU maps to tell the system which SMMU it is using to restrict the DMA devices. And also the SMMU map contains a map between the stream ID and RID. Note that every DMA device needs a stream ID for the SMMU to do the translation. In this case, in the PCI case, the PCI-BDI is the stream ID. More specifically, in the middle of the diagram, the device isolation subsystem we introduced provides several generic APIs. Well, this allows the DMA driver on the left to directly use and on the far right, the device isolation subsystem is designed to support multiple implementations such as SMMU or IOMMU. The instance only need to implement the corresponding APIs. It is easy to be integrated into the framework. So we build a proof of concept on the FVP board, which is a simulated platform. The DMA device we use is a HSA controller on the PCI bus. We use SMMU V3 to restrict the DMA controller, not the DMA controller, to restrict the HSA controller. The HSA controller is not fully supported because there is no such driver in the Zephyr for now, so we just temporarily create a driver with very limited features just for the test. As a result, SMMU restriction, the simplified process to enable the HSA controller is allocating the buffer and signaling the HSA controller to start transmitting. In theory, at this time, the HSA controller can access the memory randomly, which is very dangerous. Now, after introducing the device isolation subsystem, just before kicking off the HSA controller, the DMA driver uses the device isolation to invoke the SMMU to restrict the memory region for the HSA controller so that the HSA controller can only access the memory regions allowed by the SMMU and is prevented from accessing other regions. This mitigates the issue we talked about and protects the system. This diagram shows the SMMU V3 driver control, the SMMU V3 just by sending tons of commands to the command queue. The SMMU offers many functionality. I only list some common operations on the red diagram. I'm not going to elaborate on the implementation details since every isolation technology differs a lot. Well, yes, the SMMU typically introduced some latency for the system, but there are still some methods that we can use to help reduce the latency. Firstly, we can use linear address mapping we have mentioned. Secondly, we can use the block memory attributes to reduce the page lookup levels, just like some cases in down by SMMU. We also skip the substream table to accelerate the lookup process since substream ID is used to distinguish the virtual address spaces of different processes. If possible, we should properly use address translation services to lower down the TLB and also we can statically define some memory resources. This may help to reduce the times of translation table switches. All these methods can reduce the overhead, but the optimization really depends on the use cases. So next, I will give a quick summary. The artwork starts with the issue that the DMA can't bypass the system access control. Since more and more DMA devices drivers have been added into the Zephyr, the risk actually is increasing. So to protect the system, we introduce a hardware level device isolation to Zephyr. Firstly, we enable the SMMU on FVP board as an implementation example. And also we add a subsystem to make it easy to support different isolation technologies in the future. So there are also some future works, like try to send it upstream as soon as possible so that we can have a discussion on it. And this is definitely not the final decision, we just raise the topic so that we could have a discussion on it. And also, if possible, try to measure the SMMU latency. Yeah. And yes, that's all of my slides. Thank you. Hi, thanks. Do you know if any of this is possible on a Cortex-M device or is this something that you can only do with an A-series device? Sorry, Cortex-M devices? Yeah, like an M4. Or do you need like an A? No, SMMU is designed for Cortex-A platform. And for the Cortex-M, I cannot say. First of all, super interesting. Using the API, can you change the translation table for the SMMU according to the thread that is running on the system? Like for a thread, you want to allow some regions, so you change the translation table for the SMMU, then on the next Cortex-M, another thread is coming in, and then you change again the translation table to allow access to a different memory region. I mean, the API allows that or not? No. We just try to bring up a topic for if we can introduce these such things, and we don't have concrete requirements, so this really needs a discussion. Yeah, nice job. Is there mitigation techniques to lower the overhead implied by the SMMU? Sorry? Is there some mitigation techniques to lower the overhead implied by the SMMU? Just according to what I know, it's just some method to lower down the overhead. This isn't, there is no universal, I mean, the general method to lower down the latency. The optimization should depends on the use cases, because SMMU is just, it's actually designed for the ritual as it's not for RTOS, and therefore it's using the linear addresses, which is different from the Linux. Thank you. Thank you.