 So please let me welcome our last speaker for today, yes. So welcome. It's my pleasure to present you with the last talk of this year's microphone on that room here. What's then? My name is Matthias Langer, and I work for the German company called Kanconzern. And I want or I'm trying today to make a case maybe why other engineers and software or great engineers should talk more to each other. And when I came up with the idea for this talk, we didn't knew or didn't expect the meltdown and spectra occurrences happening in early January. So this talk is not going to be about this year. But instead, I want to share three stories with you. We experienced during our development on our microphone system, the FRE microphone. And I think I want to give a glimpse on the different issues we may experience in the interaction between hardware and operating systems, and especially properties that are implemented in hardware that are hitting hard microphone systems. OK, so the first story is about really broken hardware design. And an example here is the Mx6, which is armed SOC developed by FreeSkill. It's a pretty old one right now. So it's a 9-designed. And you can have it in dual and quad-core configurations. And one of the nice features of this SOC is that it features a built-in cam controller. So it's basically useful in automotive context. And so one day, a customer came and asked us, yeah, please write us a driver for this cam control. So you have a typical operation, go to the internet and download the reference menu of roughly 6,000 pages. Then you look in the table of contents, go to this cam controllers API course initialization and application section, and you start reading. And then, of course, at some point in time, you are stumbling over the register definitions because that is the interface as a driver developer. You need to speak to actually configure the controller and to get data to and from the canvas. So then the first register is the so-called module conflict register. Sounds good, doesn't it? So you go through all these bits here, define and try to make your mind up what is actually useful and what you probably need to set or read and to configure it yourself properly as a driver. Then you see here this bit 23, which is called sub-read. So you're wondering, oh, what is that? So look into the table where all these bits are, luckily, described for you. And it says, OK, this bit configures some of the flex can drive registers to be either supervisor or user mode. Reset value of this bit is one. And there's a little bit of more text saying that access to certain registers may be respected according to the setting of this bit, but you never find anything where the registers or maybe with the bit that access to the register is defined. So you say, OK, then maybe we start reading again, but we cannot make anything up in our mind, really. So yeah, let's try a hack and hack something into the kernel, which is running in supervisor mode on this platform and see how we find it. So we disabled the bit in the kernel, made sure that our driver read back the supervisor bit as zero, so now we should be safe, right? So we should have access to all of the bits when pieces are on the controller, and we can develop our driver. So we're trying to read the next register, and it completely hangs the system. And you're trying it again, so maybe you've done something wrong, so add debug code, try a lot of things, mostly errors, and then at some point in time, you discover, OK, so what you're getting here is an asynchronous external data abort, which is an indicator that you are unable to read from this device. So then we had an unfortunate situation with our system configuration, which converted these external data abort to a page fault, which was sent to our driver component. So the region manager then thought, OK, there is something there, so everything is fine. Resumed operation, and the page fault hit, or the external data abort immediately hit again, was converted to a page fault, so we were basically in an endless loop, and the system was hanging. OK, so once we had found out that, we tried, OK, let's see whether we have access to all the registers in the kernel, and no, we were also screwed there. So back into the manual, you start reading, and you start reading from chapter one, and trying to understand maybe what it is telling you, and then you read something about security architecture, and then you finally stumble upon an acronym, which is called CSU, and for the Germans beyond us, it's not the famous Bavarian party, but it stands for Central Security Unit. And you're wondering, OK, what is the Central Security Unit? Sounds heavy. OK, so we go to the corresponding chapter, and it says, instances of hackers and pirates breaking into portable devices and stealing private information and copyright content are becoming increasingly common. So end of citation from this chapter. And then, unfortunately, it's only one page for this CSU documentation in the reference manual. So what do we do? Yeah, there's seemingly no publicly available document for this Central Security Unit. So what you do is you go to the FreeSkillNXP website, create an account, log in, and try to see whether you can download the document there. And unfortunately, you see, OK, it's a restricted document. Of course, it's about security, so you have to make it restricted and not publicly available document. So then you kindly ask or send an email there, and never hear anything back. So you start searching the internet, and at some point in time, we finally found it in the dark net. So actually, we got it from somewhere, and then we started reading. And then we learned, OK, there is an access matrix, or access right, for each of the devices on this SRC. And then we also learned that this access matrix, or the permission there, can only be changed in secure supervisor mode. So for all of you who are not really familiar with the ARM architecture, there is the technology called ARM Trustome, which, again, divides the processor into a secure and a non-secure mode of world. And the secure supervisor mode is the most privileged mode, if you want to say so, in the system. And only this mode is allowed to change bits in the access matrix or the devices. So it basically looks like a table. So for each of the privilege levels you have, there is a mode, or a code distribution stored from where you have rights to read or write to a certain device. And we learned that you are not allowed from non-secure user mode, which our driver was actually running in. You had no access, and so that's why we were getting an asynchronous data approach. Even from the non-secure supervisor mode, where our kernel was running, we only had read access. So yeah, what do you do? OK, so we have to set the bits in the bootloader because our kernel already runs in non-secure supervisor mode, so we cannot change anything in the CSU in this access matrix. And of course, this time we were lucky because the uboot loader was running on the secure site in supervisor mode and then was loading our kernel into the non-secure supervisor mode. Because once you're in the last privileged level on an on-CPU, you cannot go to a more privileged level. You cannot go back up there. So yay. And finally, we were able to write our driver, and we were able to read and write from non-secure user mode, even with this famous bit, which holds all the hassle in first place set. So this was clearly also a hardware market. So what is the takeaway from this experience? Yeah, so we experienced a really annoying and even useless security feature because the bit setting was not relevant anymore as far as once we had read and write access to configure for the non-secure user mode. Yeah, it was kind of security by obscurity because the relevant document which we needed to actually implement the driver in our system design was somehow obscured. And making security documents publicly available is, in my opinion, a considerable practice. The hardware design makes the software implementation really complex. So if you remember, as a hack, we changed the access matrix in the bootloader. Of course, that's not the proper way to do when you're building a real product or a proper system. They would have to install a secure operating system on the secure side. And then they have to implement some API where you can call from the non-secure side to the secure side and do something for you so that it can be checked whether all the security permissions are OK, et cetera, et cetera. So it makes it very cumbersome. And I don't see any obvious benefits from this hardware design which gives you more security. OK, on to story number two, which is about the MIPS cache architecture. And this is an example where the decision of a hardware implementer makes software implementations at least a little bit more complex. So back in 2015, Alpharee and the Fiasco Microkernel were brought over to the MIPS architecture. And so we learned a lot about the MIPS architecture. And one of the special things we learned were, of course, the cache architecture. And this one is pretty defined by the MIPS architecture team. So you have at the one level separate data and structure caches. And one of their properties is that they are virtually indexed and physically tagged. So this is this VIP team architecture. Then they have a fixed line size for the L1 cache of 32 bytes. They support four weights, and between 64 and 512 lines, which makes cache sizes between 8 kilobytes and 64 kilobytes. Let me give you an example. So for the reason of simplicity, we have a weight size now of four kilobytes. And we are using a page size of also four kilobytes, which is the usual page size you use in almost every system. So here is our cache. So we have these 32 bytes of data. And of course, we need the tag, because if we have identified a cache line, we need to check whether it's actually belonging to our virtual address and whether we have the data in the cache. So then, of course, we have a virtual address of 32 bits. And to find the right byte in one of the cache lines, because it's 32 bytes, we need five bytes, five bits, to actually index in the cache line to get the right byte out of the cache line. And then we have four kilobytes of cache per weight. So with 32 bytes of data, this makes 4,096 divided by 32 is 128 lines in a way in the cache. So we need seven bits for that. So we take another seven bits from the virtual address to find the right cache line in the way. And because we have physically tagged cache, what we need to do is that we take the upper 20 bits of the virtual address, the page frame number, translate it via the TLB to get the physical page number. And then we need to compare it with the tag of the cache line we have identified. And when there is a match, we have found a cache line, which contains our data. And we can use the index into the cache line to get our data. So what happens if you increase the cache out of the size of the way in the cache by 2 by a factor of 2? So we have eight kilobytes. We still have a four kilobyte page size. Again, we have our cache, we have our virtual address. Of course, we still only need to bypass the index into the cache line. But because we have now double the amount of cache, we need the double amount of cache lines. And of course, we need one bit more. So we need eight bits. So instead of going to bits 11, we need the bits up to bits 12. And of course, we still need to find our tag. And there we have an overlap, because we still have 20 bits of page frame number to identify the physical page frame. And there is an overlap between the index byte, the cache line index bytes, and the bytes at the bits of the page frame number. So let me give you an example where this is a problem. So imagine we have the physical page 0 mapped to two different virtual addresses. The first virtual address, of course, of simplicity is the virtual address 0. We take our eight bits of cache line index. And because it's 0, we are at index 0 here. Then we take the upper 20 bits of our virtual address to find the tag. And because it's 0, the tag is 0 here as well. OK, now we have a second virtual address. And again, we have the physical page 0. So all 12 lower bits are 0. We have it here. But it is mapped as at a different virtual address. So again, we use the eight bits here for the index. And this is 128. So we find the index 128. And now we're using the virtual address to translate it. And as I told you, we are mapping the physical page to different virtual addresses. So again, we also have the tag 0. What does that mean? That means that we now have the same data twice in the cache. So there's an aliasing problem here. The problem becomes more pronounced if you're increasing the cache size even more. Oh, let me go back here. So yeah, what can we do as an operating system here or as an implementer of software? So the first one or the first solution is, luckily, that MIPS supports different page sizes beyond 4 kilobytes. So we can have 8 kilobyte or even 16 kilobyte page sizes. Another solution would be on each context switch or on each switching context to go through the cache and flush aliases there. Or you need to make sure in your operating system that when you're mapping pages to virtual addresses that you don't create these kind of aliases, which, in the end, comes down to having larger page sizes in your system. And that is what we actually done in the L4 Resystem that you're only allowed to use 16k pages on MIPS in this particular architecture. But then, before talking about the takeaways of this story, you have the problem that you have the systems implementer of ESOC, and they're assigning your MMIO devices on 4k boundaries still. So that means, in the end, that you have 16k pages or 16k page size that you are not able to separate or virtualize devices lying on the same 16k page if they are lying or assigned on to 4k boundaries. Yeah, in the MIPS hardware implementer manual or recommendations, there's actually a sentence about this. So they were aware of this issue. And they say, OK, as an implementer of the CPU, you can choose between having this aliasing problem and reaching higher clock speeds in your CPU, or you're switching on a bit, essentially, in the synthesis of your hardware with the result that you don't have these aliasing problems, but then you're only running at lower clock speeds. So you're sacrificing, possibly, performance for this. So basically, this is one example where one hardware design, not hardware design choice, but implementation choice has the effect on the software on the top level. So the last story is about graphics virtualization. This one is an example where the hardware turned out to be pretty OK. But in the end, the architecture and the poor software implementation were decoupled here. So we have a use case for drying our system with two virtual machines, one GPU, and two connected displays. And the requirement was that each of the virtual machines can use the GPU for hardware-accelerated graphics and also for GPU-accelerated computations. And this particular hardware featured an imagination power we are, graphics. This is one of the course-grained architecture of this GPU. And the interesting thing here to keep in mind is that this GPU features multiple shading units. So actually, these are the compute units in the GPU which render or calculate the job that is sent to the GPU. And in front of this unified shader array, you have a scheduler. And what you can do now as a driver is that you can send jobs to this GPU. And the scheduler will take care of assigning these jobs to the available shader units here. And then finally, they will create the result for you. So from a high-level architecture, the picture looked like this. Then you had to have sort of master unit or master virtual machine which does all the boilerplate to set up everything. And then you can spawn slave VMs, which get a piece of shared memory. And they will get their own ring buffer and kick register on the GPU where they can store the data and their shader programs. And then by telling the kick register, they could kick the scheduler to actually pick up the work there. And so it's somehow same working mechanisms like we learned this morning from Sebastian's talk about the Intel GPU architecture. So then this is a picture or an architecture picture of the SOC that was used on this board. And here you can see the GPU as a block. And then also this hardware was designed as such that there were different independent display out connectors. And each of the display out units actually had their own frame buffer. So we could assign each VM one of the frame buffers and they could render their stuff into it. OK, let's do it. So what you do usually when you do a project like this is that you try to use established software you get from the renderer and make it run so that you at least understand the workflow or the problems that there are. And even getting native Linux to run on this board wasn't that easy because you were bound to a very specific version of Linux and you needed a couple of patches. And then also the graphics pipeline or the graphics stack was very tightly constrained in their versions and also required patches. But once this had been sorted out, the next step or the next natural step was to run one Linux instance in a virtual machine and pass through all of the devices so that at least you know that you can run it in a virtual machine in practice. So this also worked quite well. And then the idea was to run this one master and one slave VM and that is where actually the fun started. So this one is a curated list of the encountered problems. And one of the nice things here were that we have a daily stand-up meeting in our company and the colleague working on this one every day introduced or started his report with and I'm still working on the graphics virtualization of my Renaissance board. So the first problem or one of the big problems he had to overcome was the signing of the dedicated display unit to the VM. So if you remember the architecture picture of this board, you know that there are three independent units and it should be as easy as modifying the device tree for the Linux kernel by only having one display unit assigned to this device tree and then putting Linux and then you should be fine because that's even more device trees. Yeah. Actually, the Linux still expected to have access to all of the display units and just crashed when it was initializing the non-assigned display units there. OK, so that's something you can work around, but it was still poor software designers, poor software implementation here. And then there was another hardware issue here. So the clock controller, which controls clock speeds, sleep states, et cetera, cannot be securely partitioned in the way it was implemented on this board. So what this essentially means is that you have the different registers for the IP and of course on this SOC to control the power states are on the same four kilobyte page and so you cannot assign them securely to different virtual machines because you can either can have this 4K to both of them and then you have to rely on good-behaving clients or you have to find something else. And then there was another problem. So the first try was to actually, OK, let's see what happens because we have to get the demo started here. We give clock module to both of the VMs and see what's happening. And what we were seeing was that one of the Linuxes was shutting down devices from the other Linux because we thought, OK, I don't see these devices in my device tree, so let's shut them down because we need to save power. OK, so we had to think of something else here. And another issue, which was also a limitation and the hardware design of the whole board, was that the graphics stack yield a significant amount of memory below 4 gigabytes. I didn't mention that this board was actually a 64-bit architecture, so memory shouldn't be a problem. And of course, it had featured 4 gigabytes of memory but only 512 megabytes of memory were actually physically attached to the CPU below 4 gigabytes. So the rest of the memory was above the 4-gigabyte boundary, which made it a little bit difficult to allocate memory below 4-gigabyte for both of the VMs. So we had to do some tricks there as well. So what's the takeaway, actually, from this story here is, yeah, from the first look, it looked like there were many things done right in the hardware. So they have properly thought about how the GPU can be virtualized and was nice architecture, but then still the clock module was probably developed in another division of the company. So they haven't thought about the same problem as well. And of course, also the memory constraints here, so imagine now, in this case, you have 3 VMs and it's still even more difficult to get the memory below 4 gigabytes for the virtual machines. And then we had to admit that the software and the Linux drivers, they were very poorly implemented. So there were a lot of issues. And you need to catch, debug a lot of things, even in the user space applications, the whole graphics stack. There were a lot of bugs and problems that needed to be fixed when we ran this even natively on the hardware. So it was in a poor state at that point in time. And then probably other people working on GPUs can chip into this song is that there is very, very poor documentation with regard to GPUs and how they work and what needs to be done and everything. OK, so I'm now coming to the end of my talk. I hope I could make at least a small point that there are issues between operating systems and hardware that could have been avoided if maybe the hardware designers better understood the requirements for the use cases of operating systems engineer. And on the other way, also, that operating systems engineers or private developers have a better sense of the use cases that are actually implemented on top. So that doesn't happen that, for example, the Linux kernel running expects all the hardware to be available if there are technologies like device which are specially designed to make the Linux kernel more runtime configurable. OK, so to conclude my talk, I have a question for you. So how many programmers does it take to change a light bulb? Are there any guesses? Zero. You just call that ALT? Zero is correct because it's a hardware problem. Thank you. Questions or if you want to share another story with us here, then we still have some minutes left. So I have a similar story. And it's actually connected to a question because there is a risk buy. And probably there won't be another chance in 15, 15 years for us software engineers and operating system designers to influence important hardware So my question would be whether you as a company or any of the other micro-companies present here have to somehow directly influence the risk buy specification. And my story is connected to risk buy. I mean, from my point of view, the specification is fine. But the reference simulator, risk buy, has a very strange input output model for the basic character input output. It's designed to be usable for performance, but not from a micro-companies. It's horrible. It has to be done in a very, the device has to be, the input device, the keyboard is equal to zero. It needs to be controlled in a very strange way. It's not micro-companies. So that's my story. OK, so I need to comment on the question part of this. So the question was that there is risk five as an upcoming architecture, which there is an opportunity or a window of opportunity where we can maybe, as operating systems developers, influence the architecture decisions made for this ISA. And so yes, we had a look at risk five, or at least we submitted a topic for the Google Summer of Code this year. And if we are selected as the micro-companies there will be for this year's Google Summer of Code, and maybe we find a student, then there will be work done on risk five. But we didn't have any detailed look into the architecture right now. OK, we have this story with a cloud core IVX5 port, and there were four cores on it. So we enabled three of them when it worked, and we enabled the fourth one, and the whole system crashed because of two firms, like totally garbage funders. So after some people didn't even come out, these funders are big ambient. So you are on arm, you are able to switch like data access from little ambient to big ambient. And to run all the first three cores were particular to little ambient and smaller than the fourth one. And you were like, why is there some peanut working, you know? OK, so this one is actually a nice story again about IVX5 this time, and was that it was a quad core system. And the first three cores put it up very fine. And the fourth core was just doing garbage. And it turned out that on arm you have the possibility to run it in either big or little ambient mode. The first three cores were running in little ambient mode, but the fourth core was starting up in big ambient mode. So was that actually a problem issue which could be fixed in the firmware, or is it just configured wrong from the hardware side? You fix it in the Chrome, but I don't know the computer. You could configure the startup, or it most is just impossible, and it can be possible. My personal perspective is that actually I think that hardware and software engineers actually talk to each other, but they are sitting all the same company, like for example Qualcomm. So there are software engineers developing the Linux kernel for Qualcomm's network and process source. And they kind of course talk to the hardware engineers, so they have short kind of communication But we as outsiders of these companies, we don't have this opportunity. So I think what's really missing is that we have a more transparent way of how these companies can interact with the whole community. And especially you and we are sitting in the same boat. We are basically outsiders. And actually the whole Linux movement is actually making things more complicated for us. Because in the past there was a public pressure on these companies to publish information. But now those companies can tell, well, we've got a Linux kernel with all the drivers. So we are fine on the nice players. And we are getting stuck with this situation where we don't have this documentation, and we have to kind of reverse engineer the documentation out of the Linux kernel drivers, for example. And I think that's one point I want to make, that it's not just a missing communication. But I think we have a problem that the guys in the company can communicate, but not with us. So to summarize your comment, and I agree with you, is probably that only the hardware and the software engineers need to know more. But also the companies selling hardware and software, or at least the base software, should talk more to the community, or there should be better communication between the different entities here. Because not all companies can have a look on all of the use cases that are out there. And with that, I would say we conclude this year's micro-connoisseur. Thank you, everyone. And especially thank Jakub for organizing everything. So we had the head for this year. And so maybe we are looking for volunteers for organizing it for next year. And if some of you like, we will meet at 20, 30 on the run in class. And we will try to find a restaurant, and we can share even more stories from the hard days of software development. Thank you.