 So our next speaker is Martin Gietzky, and the stage is yours, Martin. Thank you. Thank you all for coming to my talk. Before I start, let me just spend maybe a minute or two about myself. Who am I? So if you, in case you don't know me at all, I have been a patient programmer since I can remember, and I have a special inclination towards, let's say, system level staff and micro kernels. I have also been working on the Helen OS micro kernel multiserver project since 2004, but this talk won't be about Helen OS this year. I also changed roles quite recently, so I have spent more than 10 years in academia, and last year I decided that I should also look on the operating system development landscape in industry, so I switched to Huawei Technologies. You might probably know Huawei as a producer of smartphones. Some of you might know Huawei as a supplier to telco operators and enterprise companies, but we are a large company, so we have also our own micro kernel, partially formally verified. We have also a unique kernel, actually two of them, but I won't be talking about this either because these are closer software so far, so maybe another time. So what I'm going to talk about, actually, if you don't know the keywords or buzzwords, however you would call them from the title of the talk. So this is the primary motivation of my talk, something called memory barrier. Probably you have noticed that the issue of current computer hardware or architectures, at least since the beginning of 1980s, is that the relative speed of the CPU grows much faster than the relative speed of memory. This graph shows this comparison up to 2005, but believe me there has been no positive change with regards to this. So basically we have very fast powerful CPUs which are being starved from data. The memory RAM and of course also persistent memory is not able to keep track performance-wise. There is a textbook example and one way how to solve this problem, and those are caches. Of course our CPUs currently have a multitude of layers of caches that tries to mitigate this problem somehow, to allow the CPU to run at its top speed by caching the data that it needs to use. And of course this needs to be taken into account. So this is, like I have said, a textbook example. Imagine you have a comparison of two sorting algorithms. One is classical quicksort, which is a comparison sorting algorithm, so it runs at big O, n times log n, where the n is the number of elements you are sorting. And if you know something about your elements, you can use a very special sorting algorithm like classical quicksort, which can run in linear time. So you implement it, you benchmark it, and the first benchmark is quite reasonable, so depending on the number of elements you are sorting, you count the number of instructions per item. And you get what you expect. The quicksort is initially slower, but it has a lower number of instructions per item, but then of course the radix sort wins because it's a linear algorithm. So nothing so much surprising that that's what you expect from the theoretical computational complexity. However, if you have just a straightforward implementation of these algorithms, you might get into this. So you are now not comparing the number of instructions, but the actual number of CPU cycles that needs to be taken per one element by the implementation of the algorithm. And suddenly this is a totally different picture, so suddenly the linear algorithm is not winning for some reason. So of course there is not only some additive constant, which is hidden by the big O notation, but also obviously some multiplicative constant that is still beating where the quicksort is surprisingly still beating the radix sort. And if you dig even deeper, of course you can read this textbook example yourself, you can find out that the issue is precisely the incorrect or inefficient use of the caches. So the way how you would normally, very straightforward way implement those two algorithms makes the quicksort much more cache friendly than the radix sort. So although radix sort should beat quicksort, it probably won't because you don't use the spatial and temporal locality of the data you are accessing properly. Thus the caches cannot help you, thus you have to end up reading from the memory and the memory barrier I have spoken about in the beginning kills you. So of course one way to do it is to implement your algorithm in a cache aware way and make it cache friendly. Okay, I will stop here. I could obviously talk about this topic for four hours, but let's switch to something different. Just one small observation that you should take home if nothing else from this talk. Of course usually consider accessing memory as a constant operation, as an operation with constant complexity. But it's true that random access to the memory takes a constant time of operations, one, but it does not necessarily take one or a constant number of time units, precisely because of this caching effects which can make your algorithm running 10 or 100 times slower if you don't fit into the cache. So this is not true. Actually what you should consider is that the memory access into today's RAM is something like big O square root of n where dn is the size of your data or your working set. So generally speaking the more data your algorithm is working with, the slower it will be, obviously because you cannot fit all the data into your fast, quick caches you have to access the RAM. And there are some ways or proposed ways how to break this memory barrier, how to get rid of this troublesome issue. Because of course it is troublesome, it violates our primary assumption that accessing a random piece of memory should take always the same constant time. And one of them is somehow rethinking the entire hardware architecture of our machines. Of course if I show you this basic picture of the von Neumann architecture, it's not completely fitting. We have more complex machines than they were built and designed in 1940s. More CPUs, not just one CPU. Our peripherals are usually combined, they are not strictly input and output, they can be input output and so on. But generally speaking our machines are still von Neumann. So there is this clear separation between memory, between a computing unit that does some calculations with the data and between let's say persistent peripherals that store the data persistently. And this could be changed. So for example you might have heard about some new emerging memory technologies that try to solve the problem of the memory barrier I have spoken so that they strive to be as fast as we can make the CPUs. And ideally also combine the split between the persistent and non-persistent or non-volatile memory. So to have a persistent memory that would be as fast as the CPUs. That would obviously solve all our problems. It would also reshape the way how our software is being built. So if you have heard about projects like the machine from HP or if you would listen to a parallel talk that is currently running by Prime Proven, he might tell you some historical perspective to this different computer architectures basically with single level memory or universal memory. I have listed some of the currently being developed technologies that try to make this work. Again as you can see probably by looking on your smart phones by looking on your laptops, we don't have these technologies in them yet. So they are promising, they might help all our problems in the near future let's hope but they are not doing it yet. So I won't be talking about this either. So let me switch the topic for the third time and let me speak about let's say more evolutionary, less radical solutions to the problem of the memory barrier and this is called near data processing. So reshaping the architecture of the computers by moving the computations some parts of the computations closer to where the data actually is. So for example moving the computation partially to the memory or to the storage. Of course again this is not a completely new idea, that's why this is an evolutionary not a revolutionary approach because partial locality, sorry spatial locality of data is something that we have been using in all the other approaches for a long time. And for example GPUs are basically doing the same for the past 15 years so you are having very specialized circuitry, very specialized processors that work on the data stored close to them and not bearing the primary CPU, the general CPU with these tasks. So it's about breaking the monopoly of the primary CPU on working on the data. Our CPUs are fast but they are also power hungry. So if you can have a dedicated circuitry that can solve part of the problem part of the data processing problem closer to the data you can save on performance and you can potentially also save on the energy because you have less data to move around the machine. This has been shown to be true. I'm really not just inventing this stuff. You can read academic publications from people from Samsung for example and others who really show that this near data processing approach can work in specific cases. So for example this is one paper that shows the near data processing on SSD storages or SSD controllers where you can really offload part of let's say database queries or big data queries onto the SSD controller itself and it will perform better. Immediately you can say okay this might not work in all cases, of course not. I mean it won't work in case where the data processing is computationally heavy. In that case the poor ARM cores that are on the controllers cannot possibly beat the beef CPU, multi-core CPU which you have in the center of your computer architecture. But think about different scenario. Think about when your CPU is currently already under load. So it's loaded to 90% or whatever. Then that little help, that little push that you can get from the embedded cores in the controllers although they are not so fast, can still help you. Think about them as co-processors, as an additional computational unit that might get you more of computational power. So that really works. And about the other benefits, the energy consumption. This also has been shown to help because again depending on the scenario you can just save the actual energy I mean in watts that you spend when pre-computing something or pre-filtering something close to the data compared to the usual case where you just blindly move the data to the power hungry beef primary CPU where half of the data will be thrown away anyway or maybe even more. So of course this is very important. It's a scenario specific or workload specific and the best way or the best case where this really works is when you have a large filtering ratio or large selectivity so you save on moving the data that you would filter out anyway or where you can do some, let's say very basic pre-computations that might help you for the more heavy ones. And these are where the two branches of near data processing basically work. So first is the near data processing in memory. This is really on the DRAM chips where of course this is the problem we have started with in the beginning. So the DRAM is slower than the CPUs. The circuitry cannot be created to operate on such a speed as the CPUs do. And if you would be able to do it or if you are able to do it like using Static RAM then it's much more complex and costly. But the DRAM chips have still a lot of parallelism in them. So imagine a regular dim which you put into your machines. It has multiple chips. These multiple chips have independent memory matrices in them. This is a crude picture how the DRAM might work. So each time you need to access some work in this memory the DRAM controller has to program the memory matrix to fetch relatively long hardware words, maybe 256 bits, maybe even larger. And then you have to pre-filter or filter the smaller units that you are actually trying to access. And of course even with the current caching approaches you do some optimizations like usually you are not interested in individual words but you are interested in entire cache lines. So that helps you pre-populating the cache. You might do some prefetching. You might do things like I don't know critical word first which will fetch you the bits that you are really interested in and then pre-fetch the rest of the hardware word to the cache and so on. But I mean you can go further. You can imagine that since you have this level of possible parallelism there you can extend this gating logic to do some simple bitwise filtering. So removing the data which do not follow the pattern you are interested in or making some very crude bitwise computations. So this won't slow it in any way. And again it might save you from moving the data that you would throw away later on still. So this is one branch of near data processing. I am currently not working on this because Huawei unfortunately does not produce its dgram chip so we are limited here. But I do work on the other branch which is in storage processing. So applying very similar principles on the persistent memory on SSD chips on flash memory. Where you have, I mean you can use the same principle. You can benefit from the fact that you have several flash chips that you can access in parallel. These chips themselves have a possibility to do another level of parallelism due to the independent channels and ways they provide. And also the SSD controllers do have more computational power inherently than the dgram chips because you already need some computational power there from the beginning. You have to do some flash layer translation. You have a fresh translation layer. You have to do the garbage collecting due to the way how the flash memory works. You have to do ware leveling. So it's not surprising to have, let's say above consumer slightly on the border between consumer and enterprise flash controllers. Let's say an eight core ARM CPU. So this already provides you with more possibilities than doing just very simple static filtering. And we have our own prototype where we try to test and benchmark these ideas. This should be open sourced probably this year. So you will be able to use it. Anybody will be able to use it. And we, of course, base it on a different open source project. So this is why I'm going to spend a few minutes talking about this. So if you don't know, there is a very nice open source project called OpenSSD, which is basically a GPL implementation of an entire, let's say, real world SSD controller. So it has two parts. It has a specification of the hardware for an FPGA Xilinx platform. And you can actually really stick real NAND chips to it. And it has everything you would expect like the on-thin interface and the PCI Express and the VME interface for communicating with the host. And it also contains a firmware source, which does all the things I have mentioned. So the flash transition layer garbage collected and so on. Very nice project. And we are extending this project to provide some generic, near-data processing capabilities on top of the NVME protocol. And you would also like to push it eventually into the NVME specification. It wouldn't be like a Huawei vendor-specific thing, but it would be a general purpose standard. So what we can do, or what we should be able to do in the course of this year, is to offload some data processing code to the controller, possibly using some safe bytecode. Because, of course, when we are speaking about offloading some code somewhere, we always have to think about the potential security threats and issues. So we don't take this lightly. This is connected to the data sets. So imagine that you might have multiple tenants, multiple independent users accessing the data. Then you don't want them to trip on their shoes. So you want to be able to isolate them as the kernel would isolate them in case of normal data access. And then we have the NDP read and write commands, which are equivalent to the usual read and write commands, but with this additional data processing. So it could potentially do filtering or some aggregation or maybe other things. The computational model we are currently using is flow-based. So, of course, we don't want to have arbitrary execution on the controller, calling some syscalls, whatever, that would not make much sense. It should be really tied to what this is supposed to do. So do some data processing. And we are adding a totally new NVMe command for transforming the data. So in the most simplest case, you can think about it as data copying without sending the data to the host and back. So currently, if you have any NVMe flash controller or SSD controller, and you would like to implement something like, I don't know, copy and write on file systems like ButterFS or ZFS, unfortunately, you have to really, for the metadata, read the metadata from the device and send it back to some other addresses. This might save you the round trip. So even for this simple case, this might be beneficial. Okay, so we would certainly like to demonstrate this, not just on toy examples, but on some real-life scenarios. So we are currently working on a custom storage engine for my SQL that would make use of this by means of doing operator pushdowns. So really, that would do expect if you have an SQL query, like select something, where something that where part should be at least partially pushed down or floated to the storage. And maybe some other scenarios. We also have, just as a side note, created an emulator in QMU for, basically, it's an interesting setup where you have one QMU running the ARM firmware of the SSD controller, another QMU where you have the host, like usual x86 virtual machine, and you connect those two so that you can independently verify your extensions. Of course, this cannot be used for benchmarking purposes. This is just for speeding up the development, because all those Xilinx tools are interesting. They are not open source and we are really dedicated to the open approach. So of course, evaluating performance-wise would need to be done on the actual hardware, but especially for people to be able to poke into this, having a QMU model would be nice. Okay, so how do micro kernels fit into this? Because we are in the micro kernel dev room and I have hardly mentioned them. I think this is just the first step. This really calls for a totally different approach to programming our machines. Currently, I would describe the approach as computational-centric. So we have still this central CPU that does more of the heavy lifting and we just offload from time to time something else to some offloading units, be it a GPU, SSD, in-memory, smart memory, whatever. But we can really go further. We can really start thinking about our machines as massively distributed systems again. And not only distributed across the network, but also thinking about them as a combination of multiple heterogeneous computational units within the box you have. So not just the central CPU and some peripherals around it, but thinking about it as a combination of different CPUs, possibly with different instruction set architectures with different ways how they operate, but processing the data, running your programs, your applications in combination. And there go micro kernels. I mean micro kernels are ideal for this because micro kernels on one side create a very simple lean interface for the applications. Basically just memory management, scheduling and IPC. And the IPC part is also important because it creates an abstract and at the same time well-defined interface for creating systems like Norman basically showed in the previous lecture for creating complex systems built out of fine grained components. And yeah, this is why I like multi-server micro kernels. Hopefully many of you do also. But this can be simply extended. I mean, why thinking about just a single CPU running a single micro kernel and a single composition of components on top of it? Why not think about this as a distributed system? And most of the building blocks will stay the same. I mean, you can run maybe different micro kernels but with implementing a very similar API on the primary CPU, on some embedded cores in your SSD, in your NIC, in your GPU. And you can just spread the entire system among these different cores while keeping the communication interface the same. Of course the transport below will be different because it's a different thing to send a message from one CPU to the other. But then using shared memory in memory coherent system, of course. But in principle, the API can cover it. It does in many micro kernels. I mean, Samuel has also said this in the morning. So it's probably very easy in most micro kernels including Nuhurt to just extend the communication between the tasks from the regular one to communicating over network and going into a distributed system way. So yeah, this slide is just for the people who are reading the slides. So this somehow summarizes what I have just said. So the IPC is the least common denominator and the message passing mechanism can be used to build distributed systems very easily. And final point, we as micro kernel multiserver developers are used to think in terms of distributed systems naturally because we don't build those huge monoliths. If we do something, we know that we should stick this small component and glue it to this small component and stick another small component into it in the middle and building something complex from LEGO bricks. And it's nice. So just switching the perspective from building, let's say, a system that runs on a single CPU or on a single computational unit to distributed system. It's not a big stretch, I would say. Okay, so aren't we actually talking about multi kernels? Yes, why not? I mean this approach that I have coined, micro kernels could be also implemented by multi kernels. Actually, if you look on barrelfish or projects like that or different similar projects like barrelfish, they are basically micro kernels taken to the extreme. So running a separate kernel on each individual core even though you have memory coherency between them. So I mean this is nothing against the idea I am proposing. What about unit kernels? Okay, why not? If you insist on them, if you imagine that you have a micro kernel with a single static workload on top of it and for some reason you don't care about the isolation between the kernel layer and the user space layer, that's basically a unit kernel. So it's also covered. I mean we are all friends. So what I propose for myself, of course I would be glad if you extend this idea in any direction you choose to do some initial steps. We are already working on the in storage workload of loading. We might move to NICs maybe next year. And gradually as we implement this, practically to measure it to benchmark it to extend it, we might move from our crude runtimes, which we obviously need to some proto micro kernels. Then gradually to move, we can gradually move to full-fledged micro kernels. And the important thing that I'm obviously just skipping and it's an open problem how to create some nice frameworks for creating these distributed applications. Because of course this is the same issue with concurrency. People usually avoid writing concurrent code because it's conceptually more complex than just writing sequential code step one, step two, step three, then a loop, then step four. I mean this is simple to grasp. Writing code in a vectorized way is simply more complex to think of. And with distributed systems, that's the same issue. So of course in the end of the day what we would like to have are some very nice frameworks with some very nice abstractions, something like an actor model or agent model that would allow the end programmers to make use of this machinery in a very simple way, simple from the conceptual point of view. Okay, so this is something between engineering and research. So of course there are many open questions. Of course in the end of the day this requires a lot of manpower as usual. So I mean there is this elephant in this micro-colonel lab room every year. If we all claim that the micro-colonels are so great so why aren't we using them every day, everywhere. Basically we don't have the manpower to push them that far. I mean those are not problems of the idea, those are the problems of the implementation. We are just running out of duct tape or WD-40 from time to time. That's the only problem, at least as I see it. So to sum up I have shown you first that we have some issues with current computers, you probably know it. The memory is too slow or in other words it's much slower than we would like it to have. There are some revolutionary approaches how to solve this problem like implementing a better memory. And there are some evolutionary ways how to solve it like moving some of the computations closer to the data making use of the fact that the memory might have a big large latency but inherently it's working in a parallel way. I have shown you by means of references in order to make multi-micro-colonels. I have to thank all my colleagues from our Huawei in storage and DP team because of course this is not just my work so feel free to contact them and if you have any questions I would be happy to answer them. Thank you for your attention. No questions? Yeah, sorry. Thank you. Yes, yes, yes, yes. Thanks. I was considering your implementation of the MySQL processing. Yes. The first roadblock or hurdle that I would see for that is the data localization and data alignment problems between MySQL which is layered on a file system, layered on a volume manager is layered on. So how are you planning to skip the data locality problem? Are you planning to implement an entire storage engine? So the question is how do we plan to manage all the... how do we manage the special locality of the data in case of the MySQL storage engine we are implementing it and how do we skip the file system layers and so on between the storage and the database. So I will start with what I have repeated as the last part. Of course we first implement our storage engine as a block based so we just avoid the file system completely. We also avoid the kernel completely in the first step by accessing the NVMe SSD controller directly from user space. Eventually we would obviously need to include at least for some scenarios all these immediate steps so we would have to keep the kernel there for arbitration. We would have to have file system there for arbitration and yes, we are thinking about this. This is exactly why we also think about the security, potential security problems and issues from the very beginning. We don't want to somehow stick or glue them afterwards to the finished solution. So basically the file system driver in the kernel or in user space if we are talking about micro kernels would have to instruct the NVMe driver and in the end the SSD controller which parts of the data belong to individual files so that the NDP code running on the controller will know the boundaries. So that's it. That's the solution from the security point of view about the performance. Yeah, this of course basically you are saying that the file system layout can screw up the benefits we might get. Software hardware code design. So the file system needs to be aware of it. Some of the current file systems are somehow aware of the internal or can be made aware about the internal configuration of the SSD drive. We can even think about implementing part of the file system itself as the NDP code running on the controller. There are many possible approaches but we are not there yet but we are thinking about it. So if that answers your question. Sorry. So the question was again related to security. So how do we handle the fact that the firmware can be buggy if we understand it correctly? I mean this frankly is out of scope of our work. I mean the firmware can be buggy right now. I mean if you look on any real firmware or on any real SSD controller, a multitude of things can go wrong. It can just access different parts of the memory if it's implemented incorrectly. So on the very basic level there has to be some contract and some level of trust between the OS and the firmware of course. And once we establish this base of trust we of course need to have some strong assurances that the operating system or the end user is not affording something dangerous to the storage which can be done in at least in two ways. One way is isolation on the hardware level using MMU. Of course now you are always thinking about meltdown and spectra. So yeah we assume that these issues will be eventually fixed by the CPU manufacturers. And like from safety by definition. So having bytecode or a different way how you upload the data that is inherently safe that could be checked statically by some static checker or some static verifier on the firmware that it really does only what it's supposed to be doing that it won't crush the firmware and so on. So these are the two components of the security. So the question is how do we solve the potential problem of having different transport potentially between different micro kernels and running on these different nodes of the distributed system? Yes, so again I mean evolutionary. First we assume that those systems can use or they can negotiate some common lease denominator as a transport. So I don't know maybe it's L4 protocol or something like that. It's basically possible. If it's not possible or if we really go far into the future where we think about different communication mechanisms that's again at least could be theoretically solved by some connectors or by some adapters that basically adapt the one transport to the other transport. Think about network bridges. Yeah, they are not popular. They are not being used much. People try to avoid them as much as possible. But if you need to connect to networking technologies which are different you can have a bridge between them. That's the approach generally speaking. If it satisfies. So the question is whether this would need to be transparent. Yes, ideally yes of course. So the question is how do we make use of this code of loading on the level of application developers? Well again when you look on it from the low level implementation point of view you would have to have some relatively small reasonable functions that you would be able to somehow cross compile to the bytecode or to some other form and you would be able to transport them to upload them to the device and then just run them. But of course this is not very satisfactory. So that's the last point I have mentioned. Ideally there should be some high level interface, some high level abstraction way how to do it. So if we think about this as a data processing problem there are some paradigms, there are some computational models such as flow data flow and so on which could be somehow used to make it more feasible. But again software is complicated. I mean if you are implementing a web application just as an example you don't really care how is the browser making good use of the GPU to run your web application faster. So there are layers of software and different layers have different responsibilities. So ideally the N programmer should be mostly oblivious to it. So thanks again for your attention.