 So shall we start yeah, I have my own Not this year maybe next year I already did at the risk five the room actually Yes, thanks for the introduction. Thank you all for coming and when I'm mentioning it. I mean Most of the talk will be about the same topic. I have already Spoken about the risk five the room. So if you have seen that there is no reason for you to see see it again Unless you would like to discuss and again, this is as usual This is this is something I always do more or less like an opinion piece to spark some discussions So please I had please go ahead and disagree with me So for those who don't know me I've been working in operating system domain for some Some time I have been working on the Helena West project since 2004 I have been working at the Charles University in Prague on the formal verification of Helena West and quite recently, I mean Not longer than two years ago. I have joined Huawei where I'm doing the same stuff Micro kernels and formal verification and stuff like that so I would like to tell you that Microkernel multi server systems are better than monolithic systems. That's it. Thank you No, no really seriously. I mean This has been I would say an informed opinion of many people for many many years And gradually we got some I would say qualitative evidence that this statement is true But I mean qualitative evidence is still just basically a form of a point of an opinion Now we are gradually also getting Quantitative evidence that the monolithic operating system design is flawed So you have probably noticed there is this paper Co-offered by our friend Gernel Heizer who has been who have We had the privilege to see today and this paper basically Looked on several critical Vulnerabilities in the window Linux kernel and tried to try to estimate if this these vulnerabilities would have been mitigated if there were If we would consider State-of-the-art micro kernel design such as a CL4 and yeah I mean you can read the summary here you can read the entire paper and it's pretty much convincing So, I mean, this is this is a single piece of evidence that we are getting in the right direction We are going in the right direction with with micro kernels but and this brings me back to my original talk at Foss damn micro corner the room in 2012. We are paying some price The price is with the performance overhead and if you if you were there if you remember remember my talk I said that it's a it's a fair price According to my opinion. So, I mean there is no free lunch but you know the safety security availability other guarantees that the micro kernel provides are You know counterweighted by by some performance overhead that we need to pay But is it really? Is it really necessary? Is it really unavoidable? Let's let's look on on this The micro kernel ideas are not particularly new. I mean the the earliest incarnations were already set in 1869 But you know there was and there is still some disconnect between the software design and the hardware design designing hardware used to be complicated expensive it require usually required a huge company to back it and You know, therefore The operating systems have been written for the CPUs only after the CPUs were out So not not not before. I mean even having a powerful emulation tools like QMU stuff like that that that was not always available so You know something happened that the hardware designs got stuck in in certain ways I mean really is try to think about something Revolutionary in in hardware design that hasn't been there already in IBM SSM 370 I mean Memory management. It was there virtualization. It was there. IOMMU. It was there offloading You know computations to to dedicated hardware devices I'm talking about You know so called data channels. So how what was the exact name? I don't remember it was there already So so there's nothing new under the sun and the problem is that micro kernels suffer because You know the the current hardware is being designed for the monolithic kernels in mind or with the monolithic Kernels in mind and therefore We need to pay the performance panel penalties due to the fine-grained design due to the need of crossing Coursing the other sparse space barriers due to the IPC mechanisms and stuff like that So let's try to change it. Let's try to Design the hardware in a better way or in a way that is more suitable for for for the requirements of the micro kernel systems I really think that there is this vicious cycle that the CPUs currently don't support the micro kernels properly Therefore micro kernels suffer performance penalties when running on them compared to to the multi monolithic systems Therefore micro kernels are still not considered, you know the true mainstream I mean, yes, we have them on safety critical and mission critical devices We have them maybe in some embedded devices, but yeah, you can have my co s which is running on Mac But just as a single server micro kernel, but still, you know, we are talking about Linux And we are talking about making Linux more secure and stuff like that, which is crazy I mean the only way how to make Linux more secure is throwing in throwing it out and Since the Linux is since the micro kernels are still not in the mainstream There is no strong push on the hardware manufacturers to actually provide CPUs with Proper micro kernel support and this closes the vicious cycle. Well, there has been a lot of effort spent on On this box already I mean for the past 25 years People have been trying to squeeze out every single CPU cycle from their micro kernel code to make the IPC Run as as smoothly and as as quickly as possible, but it was given given the You know limitations of the hardware they had so so and I mean we have been trying on this too, right? I mean, this is the reason why we are meeting here. So so now let's focus on this part So let's focus on on the requirements on the hardware and let's focus on creating better hardware to support our micro kernels To finally get rid of this this trade-off between safety and performance security and pay for my performance I got some ideas. I Have to say these are very rough ideas and again This is something where I would like to spark a discussion I would like to spark, you know or inspire people thinking about the actual mechanisms that could be done to to make this happen and ideas are targeting, you know the obvious culprits like the IPC and Context switching and stuff like that So first about the problem with the IPC in the micro kernel multi-servo systems the finer the architecture It is the better for for safety and security availability dependability, but you know The more we are paying due to the address space Due to the need to move due to the need to move data between other spaces, so I mean the I probably don't need to explain the problem in very much detail But compared to the military kernels where communication between subsystems is just a function call in in a micro-cone multi-servo system But this the same communication is implemented via IPC Which means that we cannot use all the registers for actually passing the arguments Because some registers are reserved for something else. We need to switch to the kernel level to the kernel privilege mode and Switch the address space and then switch back We potentially need to do some scheduling in between in case the IPC is asynchronous Of course, this is not necessary and if we are moving Larger amounts of data. We either need to copy them between the other spaces or establish some kind of memory sharing Which again might be a little bit costly. So what to do about this? One thing that is probably would be probably quite simple just implementing richer Call or jump instructions instructions that would actually switch the address space by themselves so that you know We would we would save at least the single Kernel roundtrip where the kernel the only thing that the kernel is actually doing here is changing the address space This could be done by by by the hardware or by the CPU and it could be as simple as just switching the Current address space identifier. Of course This still needs to be just just a mechanism just a generic basic mechanism. I'm not proposing You know moving some kind of policy from from the operating system to the CPU that would be crazy so how to how to do it implement by Having something like a call gate That would be cashed in some kind of hardware cash Like something like a TLB. So so, you know, the first time this call happens Obviously, it will trap into the kernel the kernel will check, you know, the permissions capabilities and stuff like that and set up an entry into this into this hardware cash and and Consequent calls will be then done just just by the CPU I Believe this could be really very simple Regarding the asynchronous IPC where there is probably some need for buffering of the messages. I also think this could be optimized I mean, even nowadays like some like like somebody already mentioned at my talk yesterday This could be, you know, this this Message buffering and message message passing could be optimized by, you know, making sure that you don't trash The messages from our cash lines. That's fine. But again, I would imagine that The the CPU could could do it even more intelligently. So basically using the the cash lines as fixed size buffers for for the messages And it's not a problem that it's a fixed size because in most of the micro kernels I have seen that are using asynchronous IPC The kernel buffers are also fixed size for obvious reasons that the user space cannot, you know, exhaust the kernel memory So again, I would see that there is a clear separation between the mechanism That could be very very efficient very fast very lean and the policy that will obviously still stay stay in in software If you remember spark v9, this reminds me of the registers stack engine that they have there Or stack stack engine. What was the term stack engine or just a second? AI64 is it? Itanium, okay, okay. Well spark has has some things in more So how about the bulk data if we really need to move a lot of data between between the the processes or tasks Currently the current best optimization we have is memory sharing which actually works quite quite quite nice quite fine the only problem is that The memory sharing needs to be established and and simply Turn down and if this if this is happening too often this this causes the performance penalty And also donate the data needs to be page aligned So it's not really very useful for sharing, you know Scatter data structures. It's fine when you and you need to share, you know blocks Blocks of data that needs to be written to a block device driver or read from a block device driver But it's not not very useful for really, you know graph Structures and trees and stuff like that. So again an idea That could be something that could be done is to have a New simple layer of memory of hardware based memory management that would map virtual addresses to cache lines Because a cache line is usually something like 64 or 128 bytes, which is much more reasonable granularity for for this scatter data structures And of course again We need to you know sit down. We need to create a model. We need to evaluate it We need to implement it in an emulator to be sure how well this will perform What should be the parameters? What should be the size of the of this translation buffer? Stuff like that, but I really believe there is some possibility to to make this work Context switching, I mean we have somehow somehow avoided parts of this context switching In case of the IPC, but still the problem is that in a microkernel multi-server system. There are more active processes or more active tasks than you know, you know in a moralics moralistic system so there will be still some context switching and you know all That our hardware is currently doing is basically masking latency and We have very efficient mechanisms for masking nanosecond latency That's called the caches. We have quite efficient mechanism for masking millisecond scale latencies that's you know IO buffers, but the Context switch is precisely in the middle. It's on the order of microseconds and We have really nothing to to mask this latency So, I mean I wouldn't say we don't have we don't have anything There is a hardware mechanism quite often used which is multi-threading and This is precisely why why we have multi-threading to be able to somehow make sure that that the Allus or or or parts of the of the CPU have always something to do despite There is some data data or dependency or waiting for some data But this does not scale too many many threads I mean usually have just a couple of hardware threads and We can do conduct switching in in software so how about combining this and having again something like like a Hardware support for Unlimited number of execution contacts some of them the most frequently used would should be cached in some hardware cache and having some dedicated instructions that is you know extensions to to Efficiently operate with with these hardware contacts Again, this will keep the scheduling policies and stuff like that mostly to the to the operating system but the physical mechanism of You know quickly switching to a different workload when when our current workload is being blocked on the hardware level Because it is waiting on some some data that could be done autonomously. We could even think about you know Somehow connecting some other external event triggers to this like interrupts or exceptions And this would Allow us to do even more stuff. I believe I have a slight about this here Yeah, that would allow allow us to do very simply or very elegantly purely user space based interrupt processing currently the interrupts always trap into the kernel space and and In a microkernel environment what the kernel does it generates some kind of IPC message that is then being forwarded to to the user space driver if we would be able to do the fast contact switching using the hardware It is just a single step further to extend it to the interop interrupt delivery to the user space drivers Which would not only make some things faster it not only would this allow us to gather it of of Pulling in in case we are dealing with some very latency sensitive device Where actually, you know even even in a moralistic system the pulling can Sorry, the interop processing can be so expensive that pulling despite stupid is more efficient but it would also, you know solve the final, I would say it the final Compromise regarding the elegance of the microkernel design and that's the fact that we still need some device drivers in the microkernel like the timer with with Direct delivery of the interop of the timer interrupts to user space timer driver We would not need any timer driver in the microkernel and Possibly even moving the scheduler out of the microkernel, which again, you know is something like a Holy Grail of many people Yeah, something would need to be done with the level trick interrupts, you know the usual pain point Again, I would say that there is some possibility to have some integration with the platform interop controller That would autonomously mask the source of the of the level trick interrupt when it happens so that there is no no issue with this endless Reassertion of these Capabilities, I mean this is really just just this is a stretch I mean, I did not really find very much useful ideas in my head about what could be done with capabilities on the hardware level But at least at least something I mean if we just consider the the narrow use case of capabilities as as object identifiers Again, the microkernel would always need to be in charge of, you know making sure that that The methods called on a capabilities are permissible for that whole of the capability and I wasn't able to think about any any Allegant hardware mechanism how this could be avoided, but at least for the actual access to the object The capability ID or the capability reference could be somehow embedded Within within the pointer itself and then the hardware would be able to autonomously check whether whether the access to that given object Is allowed by the current context If you think about a risk 5128 128-bit variant I believe you you must You could wonder what would be the actual use of 128 bit long pointers I'm not sure that you know a flat 128 bit pointer is really so useful Maybe it is maybe I'm wrong but we could easily divide it into 64 bits for the object offset and 64 bits for for the capability reference and This could work quite elegantly By the way, I mean this would be probably even more useful for some some Managed languages like I don't java.net stuff like that because there are always dealing with the VMs running those The those manage this manage code is always dealing with the fact that they need to do a lot of bound checking on the objects If this could be a floated to the hardware, it would probably help them a lot Okay Some ideas do do you have something to add to this? Yeah, please and Yes, and I believe I Believe I have it have it here. Yeah, so I mean you might come to me and say oh these are just wet dreams I mean this is not gonna happen. So what I'm now going to present unless there are other commands or objections There are some some cases or some some prior art which somehow leads in into the same direction Well, I would like to convince you that it is possible to do something like that It's even possible to do something like that with the hardware we currently have So imagine the possibilities if we can actually change the hardware. Let's think out of the box So so the first Reference is you know, just just a basic paper rather old about Actually offloading some of the micro kernel functionality to hardware. This was this was done by by basically modifying and soft core FPGA CPU and They were moving, you know, actually complete operations like thread creation and Context switching so so I mean the context switching is something more or less in In the line what I have suggested, but the thread creation is probably too heavyweight. I would say but nevertheless, they were able to To measure reasonable performance improvement something like 15 to 27 percent and You know just speaking about the ways how the hardware could could optimize IPC This has been also done in practice in the wild on on the massive parallel architectures So so again having having a lean hardware mechanism for efficient message passing that is somehow connected to to a reasonable Software abstraction for it. So this could probably work About you know the other space switching There is an interesting paper from from the barrel fish people about space jump, which is basically a programming model where where a Single process uses multiple multiple other spaces at once and this in that case that was Not targeting some performance improvement. It was targeting, you know, just just Or it was just entertaining the what would be the possibilities and benefits of such a programming model for For let's say data-centric applications. So, I mean, this is not so much relevant to what I have been talking about But but I mean there are approaches and they were able to implement this on on barrel fish Obviously on in Dragonfly BSD. So it did not require a huge modification to to the kernel abstractions And if you if you are old enough like me, you might remember that If you are running an x86 CPU in 32 bit mode or probably even in the 16 bit mode You can still have the task state segment, which is basically hardware based context switching It it does not have a dedicated, you know hardware cache for that It just uses you know, regular memory for caching the context, but still I mean performance wise it's it's still competitive to to the to the software base approach even the linux kernel use use this mechanism previously and They stopped using it not because of the performance, but because they just wanted to have a more portable approach. Yes Yeah, but that was probably just a very, you know artificial limit because some in some index in some, you know You know the global description table could not be it was basically, you know based on the selectors and Yes, yes, so it was something like 16k or something like that. Yeah, but I mean that's that's a technicality I'm not mentioning it that this is the way we should we should do it I'm just mentioning it because it has been done. So let's look at let's have a look on it and let's improve it Thank you Okay about the cross address space calls there has been actually quite nice paper by some of my colleagues from Huawei Who have who have used the VM func VM func instruction or the VM functions? extension to the Intel VTX Which is something like that. It's it's a mechanism that allows you to basically do cross VM calls By setting up some call gates and then then, you know, just You know using a single instruction to to pass the the registers from one one VM to the other So it does the the wall switch and address space which I mean switching the extended page tables on the hardware level and And actually this paper contains an evaluation which where they where they took a rather complex application something like a web server That uses uses the open SSL library and they have separated Some of the you know encryption functions so individual function calls From the rest of the of the of the binary into a dedicated VM and they have used this VM func instructions to to you know Change it from a normal function call to this to this cross VM call and their performance evaluation was quite quite interesting Because the VM function was as costly as just a single system call So it was not I mean it was more costly than than just a jump or just just a call But not huge not more costly Definitely cheaper than going to the to the hypervisor or good going to the kernel going to the hypervisor making the the address space and VM switch in software so again, I mean if we would Think about this mechanism in more detail if you would try to improve it Maybe this could really be helpful for for the micro kernels. Actually, my colleagues are working on some suggestion in that area and They will they will they will Publish a paper or they have already a pipe paper accepted at your sis this year So if you are interested into this have a look Yeah, this is this is what you have mentioned. This is the cherry capability models So an evaluation of how how the capabilities could be implemented on the hardware level again the this allowed them to To this was evaluated on a FPGA soft core, but the performance evaluation was very positive And it allowed them to have basically bite granularity memory protection You know again the limitation is that they have used 64-bit nips So so they had to somehow squeeze in the the bounce and starting addresses somewhere So so there their obvious decision was to have dedicated capability registers like an extension to do to the MIPS ISA Which they they they self-confirm that this is not so so flexible. So how about using using the 128-bit pointers and embed the the capability I need the fires directly in them Actually, if you look on Intel MPX, this is a similar idea that again has been already implemented in some of the newest Intel CPUs According to what I have read The implementations is not so great. I mean the performance benefits compared to software bound software based bound checking is is very minor and the overhead is of a step of setting up this thing is not good, but Yeah, if Intel even do it does it so one why not try harder Okay, so to sum up I really think that We have done as as the micro kernel community a lot of work first Explaining to people that software Dependability or computer system the bed dependability safety security stuff like, you know things like that are important and Those goals cannot be achieved by using a poor Software architecture like a monolithic architecture. I mean this applies not only to operating systems. It's obviously applies everywhere see microservices, but I mean we have been always struggling to to Explain to people that they are They have to pay some price for for these for for these assurances It's funny that when When there are vulnerabilities such a specter or melt meltdown Suddenly everybody accepts a five to ten to fifteen percent performance slowdown Just to get you know the assurances. We were always thinking we are having But if we we would propose that we can have more assurances we can have safer safer, you know systems We just have to pay a small price for it. I mean that that we are certainly being rejected so Let's think about the thing out of the box and let's design our hardware in such a way that Nobody could complain anymore And maybe Sorry, I need to mention my colleagues from from from Huawei who have contributed to to the ideas I have presented but what I wanted to say also that if you really would like to do something about this practically I am opening a new R&D lab in Huawei, which will be located in Dresden and obviously the location was not chosen randomly and We would like to have a very very balanced mix between basic research so something like I have presented here something like 40 percent and Obviously some practical development. We won't be you know making products We will be an R&D unit still but we will obviously try to contribute to to our product lines Which is also good because we have clear or we should have clear requirements from our products and you know our companies producing a lot of hardware and If you would be interested in working on this, please let me know please contact me by any means one side note We own high silicon, which is one of the One one of the major risk or one of the major arm Chip producers, so we have the possibility to actually you know change the hardware That's it. Thank you And if there are any questions, yes, please Okay From Berkeley to try object support and all of them paid most of that stuff had been implemented in microcode So it was hardly slow But flexible Now why did these approaches fail? I think the idea was Moore's law killed them because regular general-purpose processors got fast enough So it killed them. We are no longer in that situation. So that might be the point of time where it's actually worth Yes, yes Yes, thank you for the comment just to quickly summarize for the stream basically your idea is that we are in the precisely good Moment in time to do some do something like this because Moore's law is no longer applying And stuff like that. So we need to do something to improve the performance in generally speaking And I would add to to your command my command that we have a risk five now And this is a huge opportunity to actually, you know, create a totally new open modular hardware Architecture that actually might have some industrial traction. So let's let's take the opportunity I mean, I mean, there wouldn't be in principle against of but on the other hand We have a full arm arm license. So we could even change arm if we would like. Thank you