 Hello everyone, my name is Zofia Sina. I'm working at Wreson Digital for the next generation platform technologist at the CTO. My team is leading a femur and toolchain efforts for RIS 5 and today I want to present you a feature that we are open sourcing. It's called cashable overlay manager for RIS 5. In short, we call it Comov. It's a software paging mechanism which I will introduce in a second. So the agenda for today is I will talk a little bit about solving code space limitation with software. It's a small introduction. Some basic concepts and use cases for this feature. The building blocks are that we needed in order to make this happen on software and toolchains and deployments. Let's start with short introduction. So on the early days of computing there was a technique to load code on runtime on the moment was needed. Actually, one of the first example we can see is by NASA using this technique on the early days of shuttle flight control systems when the code itself was replaced from the moment there was the launch sequence and then there is the orbit sequence and the code itself was changed. So this is a well-known technique. We're just reviving it. This technique was called overlay. It gave us easy interface to the user and the software engineer. There was no need for any complex other IPs like MMU, etc. And it was threaded with the toolchain and the firmware itself. Today, IOT devices have the same problem with memory. They have a very small footprint of memory and they need some solution. Alongside with these five issues with code density, it just brings up the usage of let's use overlay. Some basic concepts. Okay, so the basic concept is like there is there is an engine in the middle of the screen here. This engine is running on the fast memory and it's responsible to decide what to be loaded and what not to be loaded from the storage device on the right, which is the slow memory. It can be whatever storage you wish. And then it decides what to load into the fast memory to the SR. We call it a cache area. We call it in a cache area or an IP. And this is where the code is running according to the running flow. The main engine workload is to invoke the overlay calls and also to decide what to evict and what to load. Now, let's look how it looks today for just normal function. So for a normal function where it is a bar and foo, foo calling bar, then the toolchain will generate a jump to bar. This is by the way an instruction for RIS 5. And now it looks when you're using an overlay. So when you're using an overlay, all what is needed from the user is just to put an attribute saying my function is an overlay and then it writes this function as it is. foo is calling bar and that's it. But beneath that, the toolchain will generate a token or a descriptor as we call it here and push it into a jump address that we are entering the comma of the engine itself. So inside this engine, the engine itself will decide what to load while it will pass this token and understand where it is, where it needs to be loaded, etc. And another thing that's very important to say that once we are inside an overlay function, then all the calling for functions and all the colors calling back to the function have to run to the engine. That's because when you're coming back from a function sometimes your caller can be evicted. It was not there anymore, or it got moved. That's why you're going back to the engine to decide if this function was loaded or not, or if it was evicted, then it will load it again. Let's talk about grouping and what it means. So since we are loading from no-speed storage, we want to load as much function as we want in the same context, meaning grouping. So in grouping, we have functions, overlay functions, and already only data of overlay. When a runtime model will decide what to load the function, it will actually load the entire group and this will show us the painful storage load time. We can configure the grouping in size between half a k and four kilobytes and it's not fixed, meaning that you can get a size of groups between these sides and four k. So you can have all kind of permutations on this grouping size in resolution of half a k. So it just makes sense to group things together, that relationship together, like when you're calling a spy feature or you're calling a sleep model or whatever that you're developing, you want to put those all in one group in order to make just one shot of loading. So we provide several features in order to support that. There is the obvious feature that we call manually grouping. When the user just registers functions to a group, he knows what he needs and then he registers to a specific group and that group will hold his functions. There's another option, we call it an automatically option, which will be automatically triggered by the grouping tool, invoked by the linker. You will need a PPP pair, sorry, a profile file that we provide to the linker, we will invoke automatically a grouping tool which will represent later on concepts of how it works. And the obvious solution that you don't need any grouping. So if you don't need any grouping, you don't care about performance at all, then each function will be in its own group. Another concept we are calling, calling it a multi-grouping, it's a very special unique for this feature. And let's look at an example, explain that by an example. Sometimes different software scenarios can run the same function. For example, let's assume that the cache area, the heap that we have is very small, it's just fitting to just for just one group. So in this example, we can see my full function is in group A, it's used by function 42 in that group, and it also needed by function 1003 which is in group B, meaning we need to evict, we need to evict A when Ruby is running and back to A when my full function is not running. So when function 1003 will call that function, we need to evict B and go back to A and that's it. As a result, there are going to be too many loads, it will take just too much time. The solution is that for multi-grouping is that my full function can be lived in the same group in both of the groups, it can be in group A and also in group B. Then when function 1003 will call my function full, you will see it's in its own group and no needed for evict or load group A. So a little bit about the logic flow of the engine and this is very little bit. Overall, the function will pass through a comma V runtime engine here. The engine itself is written in C in assembly and that's why it's threaded with a target for RIS file. The main engine itself, sorry, the main engine itself will do a few things. That's the main workload. It will load or invoke, invoke over the function. It will also handle evict algorithm as a cache, sorry, cache concepts. So we're like an LRU or etc. This is what the grouping do itself. So this is what the engine itself is doing and it also have an implementation algorithm in order to do, you know, to close holes in the memory when you evict groups. The building blocks we needed for having these two blocks. So with our collaborator for Embercosm, we did a lot of changes on the tool chain. As a compiler, we choose the LVM and the client is funded. The compiler will just create special calls for the overlay functions as we see earlier. For a linker, we choose the GNULD. These few changes are on the BFD, creating descriptors, tokens as we saw it, for functions and offset table for overlay functions. Those tokens on the tables are being used by the engine itself on runtime. So this is all for the engine itself. As a debugger, we'll choose GDB to provide easy interface to the user for debugging overlay functions and overlay calls. Because sometimes you want to do a step into a function and just get into your function, don't want to do all the mechanism of the overlay. But sometimes you do want to debug the overlay engine itself. So we need awareness from the debugger side. Other utilities, the grouping tool, some extension to map file and other service utilities that we are going to provide. That's all to make the life easier for the engineer to develop and to design. And that's all coming. So grouping. So I want to show you the grouping tool techniques. But that's a very high-level concept of how it works. So you can see here in Instagram of time. And each color here represents an overlay function, which has been called 12 times, for example, here. And 10 times, for example, here. That's the orange ones. And there is the blue ones, etc. Then we can see there is kind of in each stage here. That's a lot of activities here. And then a lot of activities here, here, and here. So what is a grouping tool need to do is need to follow and split all the activities and find the hot areas in order to make a recommendation to do a grouping. So let's separate that and see how it works. So we can see, so the grouping tool will get a prepare for fire, as I mentioned earlier, for all of the functions. And then we will start to building histograms for each function. So we can see here, for example, the brown one, there is activities for function A, activities for function B, and activities for function C. If you merge everything together, one above other, then obviously we can see that there is something that recommends do a grouping about the brown and blue. And also, green is by himself. So make it a group by himself. But if it fits, of course, to the first ones, then we would like to merge them also. So following what we saw earlier, this is like the initialization stage. And this is part of the steady state stage for deployment. So first of all, come on these open source. It's already on the GitHub. We can access that. You can see it. It's there. It's designed to fit bare metal software and with autos based software. We're currently targeting free autos to support that. So it will be threaded with free autos. The first source graph of bare metal is all the open source, as I said. And the support for free autos is on the go. So it will be soon. And also on the GitHub, we can see a draft of the tool chain itself. It's already built for Debian targets. But sooner it's all going to be also open sourced. So pay attention. By mid 2020, we're going to have a full deployment of Coro V, running on a real hardware platform with the OS. Currently, it's running on a few other platforms and it runs on ISS, which is cool. But because it doesn't depend on any other solutions, then it can run on any hardware, but it has to be justified. Following next, more demos for usage of each comma V API. The target is that we want like an SDK for developers to see all the demos, all the APIs, and we can start design. It's based on comma V. This is the GitHub that we are providing it. And soon it's going to move to cheaper lines, but so far it's here. And also we're going to provide a tool chain branch to support it. That's it. Thank you very much. Questions? I have some time. So you would be typically loading your modules from NAND flash, and it's because you don't have NOR flash that you cannot directly run it? Or what is it? That's correct. Yes. The question was where we are targeting to load the storage itself, we are targeting for NAND or something like that, right? So the implementation itself, it doesn't care if it's an NAND flash. It can be even a network because we are providing a hook to users saying, load this function for us. We're going to provide them the address where the source destination and the map for the function, and we'll decide how to load it. But yes, you're correct. Observation, it's target for NAND when we are talking about Western Digital, right? Yes. Sorry, can I hear it? It's not, it's more like the question was if it's doing a change using a re-implementation. Okay. If it's a re-implementation of the loader model in Linux, so the answer is no. It's something that we are working with many years. It's very bare metal, very embedded solution. I will not recommend using that on Linux, for example. There is a performance impact every time you go into the engine, right? It's doing, yeah, it's doing almost the same thing with a variety of changes, yeah, and the touch and support, of course. Yes. Okay, so the question was if we are going to use this with Zephyr, so the answer currently is no. It's all dependent on how much efforts we are going to do in free autos. Meaning if we're going to do this very model, very tiny changes in the kernel itself, then it will be very easy to move it to any other autos, right? So we can do it, for example, for Tredex, we can do it for Zephyr, we can do it for whatever. That's, time will tell. But currently it's free autos. And the other question? So that's a long question. I will not repeat that, but if I got it right, you're asking if the cache concepts work like a real cache, right? Like using an RU, if functions are very hard, not to evict them? That's what you asked for? Yes, yeah, yeah, yeah. So yes, the answer is yes. We're investigating other options to monitor the cache itself, not using just LRU. For example, we can use LFU, which is the last, least frequently used. So yes, it's on the plan. More questions? I have time, right, Palme? Just a minute, so. A minute, okay. So you want to modify the tool chain, obviously, to do that? Yes. Does, how does that interact with the link time optimization? Okay, that's a good question. So it doesn't. Okay, so for example, not only link time optimization, also for inlining, for example. Okay, so each function, which is an overlay, will not be inlined, and also will not participate in the LTO, because it's very unique functions, very unique flow. Maybe you could still do LTO within a group that you could define? That's maybe a good idea, right. Yeah, we can examine that. That's it. Thank you very much.