 Okay, hello everybody. I believe it's about time to start, right? Thanks for coming to this talk. If I'm not speaking loud enough or if you don't understand or if you disagree, please feel free to interrupt me at any time. So this will be something like a crossover talk between microkernel dev room which has been organized by us or somebody else from the microkernel community for several years here, and the RISC-5 hardware stuff. So let me just briefly speak about myself. So I'm an operating system guy. I would say I'm a microkernel guy. I have been working on the development of HLNOS for many years. I have been working on a formal verification of HLNOS as part of my research employment at Charles University in Prague. And quite recently I have switched to industry. I'm working on microkernels at Huawei. So what the hell is HLNOS? If you don't know, HLNOS is an open source, general purpose, multi-platform microkernel multi-server operating system designed and implemented from scratch. So these are, you know, the bus boards. Very quickly, it's not a Linux distribution or a BSD clone or something like that. It's our own microkernel design, our own user space built on top of it. Have a look on the website if you are interested. It's an open source project. Obviously otherwise I wouldn't be even speaking here. We are not targeting embedded. We are not targeting real time. We are not targeting servers. We are not targeting desktops. We are targeting everything. So it's a general purpose operating system. And somehow we have this tradition of, you know, implementing, you know, rather in the breadth first manner than depth first manner. We support these hardware architectures. I would say we support them properly. So we really try to make the code base very portable. Only about, I would say, 5% of the source code, of the size of the source code is platform dependent. Otherwise, everything is platform independent. And this is the reason why I'm talking about the risk 5 and porting LMS to risk 5. The multi-server keyword means that it's a microkernel. So the kernel component is small. It just does what the kernel should do and the user space cannot. But also the software architecture of the user space is very modular, very fine grained. So we don't have huge monolithic components even in user space. We have very small components. And each of the component is doing just one thing. And we have some guiding principles that we base our software architecture on. Again, if you are interested, please ask me offline. And I will be more than happy to explain to you in detail. So reserve like five, six hours for that. This is a screenshot. Originally, I feared that I would not be able to connect my laptop to this magic casting box. But I can even show you an actual demo. So this is not on risk 5. This is Hellanoas running on x86 or AMD64. So you get a normal bootloader. Now the microkernel is booting. And this is the user space environment. So you can see that it's relatively feature rich. We have our own graphic user interface. You can move windows. You can rotate them if I find the rights. You can resize them. I don't remember. Oh, yeah, you can rotate them and stuff like that. There are already something like 45 tasks or processes already running. So each individual device driver is running as a separate process in user space. That's the microkernel design principle. We have a networking stack. Well, obviously we are not there yet. Because as you can see, this is our web browser. Beautiful. But I mean, at least the networking is working. So this was a short demo. Let's get to interesting stuff. What is our goal? I mean, generally speaking, it's about dependability. It's about creating a software architecture on the system level that would provide safety security and other guarantees for building dependable software. So the architecture, as I have already described, define grain components that are isolated from each other. This is basically limiting the blast radius, as it is called. So if something goes wrong, in a monolithic system like Linux kernel, if there is null pointer dereference in a device driver, the whole system goes out of the window. In our case, only just the single task, just the driver goes out of the window. And there are potential mechanisms how this could be solved at runtime. It can be also solved at design time. So you can apply formal verification techniques to make sure that actually there is no null pointer dereference in any of these components. Formal verification techniques are generic. They could be applied on any code base. But the monolithic nature, the monolithic architecture of Linux or most of the BSD, systems makes it hard to actually practically apply the formal verification methods because the code base is simply too large. In our case, you have these individual separate components and you can verify them piece by piece. So this is the way. We also try to write clean understandable source code. So our comment ratio is 38%, which is probably nice. I mean, again, these are not very surprising ideas. It's about putting all the software engineering bits and pieces together. So having a good software architecture, having a good implementation, doing the verification and having a good development process. And ocean liners have been built like this for hundreds of years. You don't want to have a single hull of your ship that is carrying, I don't know, 10,000 people because a single iceberg can just destroy it. You want to have these watertight bulkheads so that it really requires a huge error and a huge iceberg to sink your ship, but not a tiny one. I mean, obviously, there is no silver bullet, but you can do things better than just have a monolithic design. This picture summarizes it nicely, I would say. These slides are more or less for your reference if you would like to go into it further. I understand it's not really readable at this size. So these are the functional blocks in our microkernel. The only thing I would like to stress here is these hardware abstraction layers. So we have this really, even inside the microkernel, which is like the smallest indivisible component, we have some internal structure still. So we have some parts of the code that are platform dependent, that needs to be adapted when we port LNOS to a new platform. But most of the code base, even of the microkernel, is independent on the target platform. And this is just a very simplistic view of the user space. So there is the microkernel at the bottom, and then we have naming service, loader, task monitor, and init. These are more or less still very critical services. They are in the trusted computing base, and then we build gradually less and less trusted services on top of them, such as the file system stack, the device driver stack, the networking stack, and so on. By the way, our networking stack is also decomponentized. So it's not a huge TCP IP, everything library, but there are individual tasks, individual applications that take care for the transport layer, for the link layer, for the physical layer. Okay, so that was generally speaking about LNOS, and you are probably interested in our RISC-5 port, because this is a RISC-5 dev room. So I did some initial experimentation in 2016. I had a talk about this in the microkernel dev room at first at that time. So initially, if you remember 2016, the privilege ISO specification of RISC-5 was in the version of 1.7. There was no upstream tool change support, so no GCC, no bin utils, upstream support, and the only usable emulator or platform was Spike. So it took me something like 18 hours to get some basic functionality. So to set up the infrastructure, create the directory structure for the platform-dependent parts, implement our own bootloader. I just did not like the BBL, so I implemented my own, and then some initial virtual memory management setup and kernel handoff. Some observations from this. There were many things very badly documented at that time. So many things like the ABI or the IO interface in Spike needed to be basically reverse engineered from the source code of the tools and of the emulator. Some other details were still sketchy, like the memory consistency model at that time. But I mean, generally speaking, the architecture from my experience with other architectures looked nice, maybe except the strange compressed page protection fields, if you still remember from that version of the privilege I saw. There weren't individual fields in the page table entries for read, user read, kernel read, and stuff, or supervisor read, right, and execute, but there was this strange combined compressed field. I mean, why not? I mean, it just slightly complicated our macros for that, but our abstraction was fit to it, but it was just strange. So then I have found some time to get to it back in 2017 when I implemented the basic kernel functionality, as we call it, so basically everything that the kernel needs to actually work and hand over the control to user space, meaning exception handling, context switching, atomic operations, and some basic IO. So the I saw privilege I saw specification moved to the version 1.10, which I believe is still the most current one. So there were some small improvements, for example, they have removed those strange compressed page protection fields. The only usable emulator at that time was still Spike, and you know, the HTAIF input device is a horrible design. I mean, really, I cannot imagine how somebody could have come with such a strange input device because it has no interrupts, which is more understandable. There is no platform interrupt controller defined in this specification, but if you do what you are supposed to do, so you pull the device, you basically send a command to it asking, is there a character available? And if no, you don't get a zero reply or something like no character available. The request gets buffered. So, I mean, just think about it. How would you normally pull this device in other way than periodically pulling it? But how do you get rid of these buffered requests when there is no character on the input? I mean, this is just a memory leak on the emulator side. I don't know. So the point is, or maybe the moral of the story is that there was still no reference platform. Some decent specification of the platform, except the CPU itself, that would provide some reasonable basic debugging IO, some platform interrupt controller and stuff like that. And yeah, I mean, if you are porting to something where the GCC tool chain has been just upstreamed, you might encounter internal compiler errors. From time to time, this just happens. I'm not blaming anybody. This was fixed in the next release of GCC. And honestly, I did not even spend much time debugging the compiler bug. I just removed the piece of the code that was causing this bug, waited for an update of the compiler, and then it worked. I'm not a compiler expert, so it would probably have taken me a lot of time. So, and now I finally got some time to get back to it. So this is what I've been doing for like eight hours this January. So I decided to switch to the QMU Word platform, which somehow is, according to my opinion, more reasonable than what the spike is providing, because you have the platform interrupt controller over there. You have a normal UART serial IO there. You can use Word IO for networking and stuff like that. So this finally looks like a decent platform to support. And the tool chain, everything is already upstream, so this looks pretty usable right now. Okay, so what are the lessons learned from this very brief experience? First, there was surprisingly very little interest in porting HLNOS to RISC-5. Of course, you might say this is because nobody cares about HLNOS, but I mean if you compare it to our previous porting efforts to ARM, Spark V9, Spark V8, other platforms, there was always a lot of interest, I would say. So either in the framework of Google Summer of Code or in the framework of master fees, there were people, students who would eagerly take HLNOS and port it to a new platform. And there was, I mean, no interest with respect to RISC-5. So I mean, the only thing that was done was done by me in my very, very, very precious free time. And I'm a RISC-5 enthusiast, but I just don't have the time to do it. So what are the reasons? I believe that there are two. First, like I've already said, there is still no nice reference platform that would actually provide interesting features, more interesting than a serial console. So I mean, obviously people nowadays are not interested in seeing Hello World for a serial console. They want to have HDMI, they want to have USB, they want to have networking, stuff like that. And that would be easier to achieve if there would be actually an easily accessible, I mean, cost development board that would provide those features. So something like Raspberry Pi for RISC-5 with powerful RISC-5 CPU supporting the supervisor mode and for a reasonable price. Yes, of course, you can have Sci-Fi board for, I don't know, $1,000, but I mean, that's too expensive. So once this is solved, I would say that it could be generalized that the RISC-5 would get much more attention and adoption by hobbyists, by students, by researchers because they just need to... You know, it's really hard to explain even to your boss that this Hello World print out took you like two weeks of porting or coding. And there's one other thing. So this is something I would like to spend the rest of my talks speaking about, that there has been very little, according to my opinion, input from the operating system guys to do RISC-5 specification. And therefore, from this point of view, from a point of view of a microkernel operating system, RISC-5 does not bring anything new to the table, which is a pity because how many opportunities in your lifetime do you get to come up with a new instruction set architecture that might actually get some industrial traction, that might actually be adopted by the big players? How many times do you get the chance? So why do I think this is something that RISC-5 might focus more on? The microkernel idea I have spoken about, you know, this fine grain components, isolation, blast radius limitation, this is definitely not a new idea. It has been around since at least the end of 1960s. And it has actual benefits for safety, security, dependability. Let me just skip to this slide. Most of the benefits were so far somewhat questionable because there were more or less just qualitative benefits. But now there is actual quantitative proof that these benefits of the microkernel design is there, that the benefits are there. So there has been a study published at a peer-reviewed scientific conference, which basically analyzed some critical vulnerable therapies in Linux and examined them how they would have been mitigated or prevented by microkernel-based design. And you can read here 40% of those vulnerabilities would be completely eliminated by an operating system designed based on a verified microkernel, such as a CL-4 in this case. So, I mean, there are huge benefits of having a proper software architecture on the system level. And obviously, this is for a price. So the price that we pay for these benefits is the performance overhead. There has been a huge effort in the last 25 years to make this performance overhead as small as possible. I mean, the whole effort of the L4 people in various projects was to make this overhead as small as possible. But the overhead is still there because, for example, if you would like to run a single file system-related operations, such as open file or read block of a file, microkernel multi-server design, you need to talk to some location or naming service. Then you need to talk to a virtual file system service. Then you need to talk to the file system driver. Then potentially, this file system driver will forward you to a block device driver and so on. So you have this overhead of the IPC between those isolated components. The isolation has its cost. And the question is whether it has to be like that. I believe that the cost is being caused by the fact that the CPUs have been designed so far just with the monolithic operating systems in mind. And again, I'm not blaming anybody. I mean, it is more natural because designing a new CPU or instructions that architecture used to be a very complex task. It used to be very expensive. And people naturally designed the new CPUs according to the requirements for the old CPUs. And then the operating system guys were just facing with what they got. And I mean, the monolithic design was performing much better on it. So let's switch the gears or let's reverse the process. Let's try to design the CPUs with the requirements of the micro kernels in mind so that the CPU will be able to provide abstractions or instructions or mechanisms that will help the micro kernels perform as nicely as the monolithic kernels while keeping all the nice safety and security features. And I will present a few ideas. I won't go into too much details because first we don't have the time. And second, I mean, I would probably rather spark a discussion than present something I'm sure is going to work. But these are a few ideas. So for example, we would like to optimize the IPC itself. Where's the problem? In a monolithic system, if one subsystem is calling into a different subsystem, all it does is a normal function call. So there is some passing of arguments in registers or on stack. And the code is free to pass direct pointers to data structures. So this is efficient and also unsecured because all the reasons we have already discussed. In a micro-carnal, multi-server design, you need to do the IPC, which means that you have to call some kernel syscalls which will pass some arguments in registers again. But the set of registers you are allowed to use is so naturally somewhat limited. There is the privilege level switch, the address space switch between those two components. If the IPC is asynchronous, there is some scheduling involved. And if bulk data needs to be transferred, the data needs to be copied between the address spaces or there needs to be some memory sharing established. So where the CPU could actually help? I mean, why not design an extended jump or call instructions and of course also return instructions that would actually also do the address space switch. So imagine something like a call gate, which would be basically like a calling capability that would identify the target address space and the target program counter of the server IPC handler. And this could be implemented on the hardware level, for example, like a cache, something like a TLB cache structure, which will be populated by the microkernel. So again, the microkernel will be fully in charge and will be fully in charge of deciding which client can call which server. But for the most frequent calls, the mechanism will be streamlined, streamlined by the hardware. So the context switching and address space switching will be done by the hardware. For the asynchronous IPC where there is the need to somehow buffer the payload of the message, this could be also streamlined by the hardware by using cache lines as the buffers. So the data would not even need to go to the main memory. Yeah, sorry. I mean, I'm not saying it's different. I'm saying it goes beyond what has been done. Again, these ideas are not floating in the air. These ideas are more or less based on the optimization techniques that has been done by the microkernel developers over the past 25 years. Where they actually try to squeeze as much CPU cycles from the cores as possible. But my point is that if the CPU would be more helpful or would provide more space for optimization, the overhead might be even lower or maybe ideally zero. So I mean, I'm not saying it's different. I mean, we are talking about the same direction, but let's think about actual hardware mechanisms to make this more optimal. And generally speaking, pinning workloads to cores is possible, but we would also like to have efficient cross core IPC. Okay, what about the bulk data? So again, this is not such a great issue nowadays because many microkernels such as Hell in the West implement memory sharing to efficiently transfer large amounts of data between the other spaces. The problem is that this memory sharing needs to be established. And if this establishment and possibly the teardown of this is happening too often, this causes the overhead. And also the data needs to be page aligned, which is a minor trouble. So again, how about having an additional memory mapping mechanism in the CPU that would allow, for example, to do fine grained mapping from virtual addresses to cache lines directly. So cache lines are pretty small, something like 64 or 128 bytes. So this would be ideal for the sharing of smaller data structures, ad hoc sharing of data structures between the other spaces. What about the problem of the context switching? Again, this is, I mean, most of the CPU optimizations, again in the recent years, have been targeting the problem of hardware latencies. So we have caches to hide nanosecond latencies of the DRAM. We have software caches like IO buffers to hide millisecond latencies of disk drives and SSDs. But what about the microsecond latencies of the context switching? There are mechanisms to solve this, like hardware multi-threading, which is very effective in doing that. But usually in current CPUs, you have just a fixed number of hardware threads. And you have to use software-based context switching to have more. So how about finding or designing a mechanism that would combine the benefits of both? So something like hardware cache for thread contexts that would scale to a reasonable amount of contexts and dedicate instructions to store, restore, and switch those contexts. This could be possibly optimized for the ABI because you don't necessarily need to save all the registers in all cases. And it would also help with, for example, if combined with some autonomous mechanisms, for example, triggering the context switch by some event, like an external interrupt, this could again eliminate the round trip to the kernel and doing the context switching software. Obviously, we need to be careful because we don't want to implement some kind of hardware scheduler or hardware scheduling policy that this would be probably disastrous. So again, we would just like to have the mechanism in the hardware but let it be controlled by the software. And if combined with some kind of simultaneous multi-threading, this could very efficiently even mask the nanosecond latency like on the caches and parts of the cores. User space interrupt processing. If we have user space device drivers in a microkernel-based operating system, we always have this unpleasant round trip through kernel where each interrupt needs to be first handled by the kernel space, then the kernel space generates some IPC message which is then forwarded to the user space driver. Why? There could be a mechanism to directly deliver the event to the user space device driver. Of course, there is the normal pain point of lever-tricked interrupts but again, I believe there are ways how this could be handled by, for example, automatically masking the interrupt source in the platform interrupt controller in those cases. And this would not only lower the overhead of having device drivers in user space but it might also finally fix the single remaining architecture flow of current microkernels meaning that there still need to be some device drivers in the kernel space. For example, the driver for the timer. With this, the timer driver could be pushed out from the microkernel and even the scheduler could be completely pushed out of the microkernel with just a little help from the CPU. Okay, final topic. I won't spend too much time on it because honestly, I don't have very clear ideas how this could be done but think about a RISC-5 128-bit architecture. I don't believe that having 128-bit flat pointers is very useful maybe for some situations but I wouldn't say that this is... It's such a huge address space that there is no practical use for it but what about dividing logically the pointers into 64-bit object identifiers or object capabilities in the parlance of microkernels and 64-bit offsets? That would allow the hardware to do very efficient bounce checking to make sure that the only owners of the objects have access to them and stuff like that. Maybe in this particular case this won't help the microkernel so much because the microkernel capabilities are more about resource management than physical bounce checking but this could be also useful or maybe even more useful for some managed languages VMs because they wouldn't have to implement the bounce checking in software but there will be a hardware mechanism for that. Okay so these were just some ideas to convince you that this is not just a total wet dream there has been some prior work that somehow points into this direction so for example there is this paper from 2005 that evaluated uploading some microkernel operations to hardware they have implemented some modifications to FPGA based software and the evaluation is that there is something like 15 to 27 percent performance improvement and this was based on offloading I would say coarse grain functions such as thread creation to the hardware. I'm talking about much finer grained ideas here and yeah hardware message passing has been actually implemented in some hardware devices so why not push it into the mainstream. Regarding you know the hardware support for different address spaces many of you have probably seen this paper about the space jump programming module which showed on the barrel fish Baltic kernel and on Dragonfly BSD that this could be useful for different kinds of applications for example data centric applications and you know if you are as old as I am you might remember the task state segment on AI32 which was kind of hardware context switching mechanism I mean it actually still is in that chip that most of us have in our laptops and it was actually even used by Linux and the performance was not poor I mean it was quite comparable with the software based mechanism at that time and the reason why it was removed from the code base was because of portability different reasons than performance so maybe there is still good chance to revisit this idea again. About the cross address space calls again there is a prior R that has been actually available in most of the x86 chips that we have which is called VM functions which is basically the very same idea so that there is a way how to how to make efficient calls from one VM in the VTX domain to another VM and exactly there are some there is some fixed number of entries that could be used so far and it has been shown by some of my colleagues that this could be really used to take a monolithic binary like I don't know a web server with an SSL library and split it into very fine grain components I'm really speaking about very fine grain components like individual functions and put them into into separate VMs so for example the business logic of the web server would be would stay in the original VM and the encryption and the cryptographic key management would be pushed out to a different VM and connect them using this VM function instruction and the cost of of this separation is comparable to a single syscall so it's not not terribly bad for performance and again this is just misusing a mechanism that hasn't hasn't been designed for for the purpose I have taken I have been talking about so what about a mechanism that would be specifically designed to help microcrown off so I'm looking forward to a new paper about the skybridge mechanism of my colleagues that should appear at your sys in a few months about the capabilities I mean there there's not much but there is this again this paper about hardware based basically bound checking which was implemented on as an extension to the 64-bit MIPS ISA where there were basically 32-bit bound registers or capability registers which contained you know the information where an object boundary is and this was checked by the hardware again the performance evaluated on an FPGA software was very nice the limitation in this case was the inflexible design if this would be done in more flexible let's say thanks to the fact that we could have 128-bit pointers so so we wouldn't have to have dedicated bound or capability registers but the bound will be encoded in the pointer itself why not use it and actually Intel MPX is also more or less in the same direction okay so that's probably all for me so I might work on RISC-5 port of Helena OS in the future but if you are interested feel free to to to drop me a message but I would say that this is really a great opportunity for everybody from from the RISC-5 community to to help the software move from from the you know poor flawed monolithic architecture to the microkernel multiservo architecture get all the benefits that we know are there without the performance benefit and the final note to it is thanks to to some of my colleagues that that have contributed their ideas and also if you would like to practically work on this we are opening a new R&D lab in Dresden the lab will focus primarily on microkernel development but we would like to have a very well-balanced mix between you know basic research which is this topic I have spoken about and you know let's say more practical stuff and we are starting from scratch so we will we are like in a startup mode within within a company and since Huawei owns high silicon which is one of the ARM chip producers we will actually be able to talk to the hardware guys and maybe there will be actual physical tangible results out of it thank you and if there are any questions I will be happy to answer them I mean we are we are trying to do it ourselves yes I mean yes that was that's I mean that's the summary of my talk we can do it nowadays yeah so the question is yeah the question is was why aren't aren't we doing it already right if I can rephrase it yeah we are trying to do it right now because I think this is a very precise moment in time to do it thank you yes go ahead I believe we can manage so the question is whether it wouldn't be helpful to rephrase those ideas and questions in terms of virtualization and stuff like that definitely it's possible because I believe there is essentially no difference between between the microkernel architecture or microkernel and the hypervisor I mean most of the microkernels that are being used you know practically a cl4 q and x pico s they also act as as a hypervisor so definitely that would make sense and we are doing the same in our company actually okay so let's talk afterwards thank you very much again