 Hello everybody Okay, obviously you may hear me, but I'll try to adjust it a little bit Is it okay if I speak that way? I don't need to be closer to the microphone Okay Again, hello everybody. My name is Alexi Broadkina and today we are going to talk about how a Zephyr RTOS might be helpful to leverage all the benefits of your modern multicore design and and So let me start First a couple of words about myself. You may read something here Basically, I've been doing embedded software development for a couple of years well more than 10 years already And mostly these are implemented things on modern 32-bit microprocessors or micropontrollers if you will and so I've been doing a lot of open source as well And so lately I started to deal with Zephyr RTOS So what are we going to talk about today? First? We'll see I'll try to convince you that so there is actually a demand on of multicore deeply embedded or basically embedded systems and so hopefully You'll see what is a real need for that and then we'll go to a discussion of about two fundamental fundamentally different Approaches with multicore design and so then we'll see how those are mapped to the Zephyr RTOS how it could be used there and so then we'll spend a little bit more time with SMP because it is From one point of view it is more standard And so that's why applicable to a wider variety of different hardware But so then it is also quite a tricky one because it requires you to deal with a couple of interesting things So let me continue then with what we have here. So Why actually we might want to use multicore in embedded systems? It sounds quite strange today, but we are getting there. So What happens we there is no such thing as enough performance because whenever we have More performance. We are trying to do more funky stuff And so that's why at some point we end up having way too many work to be done And so what we do we try to achieve more So what what we need to do? We need to execute more instructions per second, for example That's how we achieve that if we are not talking about improvements in CPU architecture So we are trying to execute instructions faster. So we Try to make our clock Be high and higher and at some points we end up with a couple of things so either together or separately So we may have a problem because we cannot scale our Frequency any longer because our tech process doesn't allow us to do so or we may see that so we are already dissipating So much power that our packages cannot resist that so we cannot increase our Frequency any longer, but then we may have another challenges. So for example, what if we have a critical task? We cannot afford switching off and Let another Task another application to be running. So then we want that particular third of execution to be always to always have a Real hardware to work on so we may immediately start it or we may not even stop it And so in other very interesting topic, we may have a very specific task like convolution neural network or DSP of some kinds, which if implemented on a normal general purpose CPU It will ruin all the performance because it will require so much compute time that you won't be able to do anything else while doing even not that many of DSP or CNN So what then we may do we may use just more CPUs and so then these CPU cores were different CPUs itself with that we have a scaling of megahertz. So we used to have one CPU and now we have two so we used to have Megahertz budget like one thousand two megahertz and now we have two or three or four whatever So we may obviously do more or we may keep one core for execution of for something and so all the rest for anything else That's we need And that's why we gain real not viewed about real parallel execution, which also helps Or we may want to use a special accelerator course which will do something very efficiently like that CNN or DSP or something So what that's what we are getting and so here are a couple of real-life examples. So for example LTE modems They are interesting because computational Amount of commutations we need to do to implement LTE stack is tremendous and it is not only a controls stack We have a lot of DSP calculations. So typical designs They consist of a couple of DSPs or even ASIPs and so also we have a general purpose CPU as well Because we need to do all these calculations and why we need multi-core because we may even have a telephone call like a voice call And we want to process that as well and not being interrupted because we are transferring data one other Good example is audio video DSP because they want to have your cell phone playing music but not consuming all your battery in an hour or As I mentioned already AI or a video recognition Video processors you want to have a separate core which may have like a couple of thousands max So which will do multiplication and accumulation in one cycle Compared to you to what you have like one or two or max on your normal CPU And so that might be very efficient as well And another thing you may have like user interface which consumes a lot of compute resources and still you want to be To be smooth and do something else as well And so that's a typical design which we used to have for lately So we have probably we used to have only one core, but then we went with multiple cores And so still that looks quite simple But so that example that's just your cell phone or something and then you want so to play music on the go and Again, keep your batteries safe and not so drained in an hour So you add to DSP core which does only one thing it decodes something and feeds your deck for example And so that thing communicates with your CPU through a special mailbox for example It's not obligatory so but so let's assume that so to display complexity which we may face And then you want so something else you want your phone to consume again No power running standby mode, but so then you want it to react on something when you for example When you walk and it will count your steps or for example, if you talk to that hello, my device It also wakes up so we have a separate very downscale TPU core Which only talks with sensors and send interrupt to the main core whenever it is whenever it thinks it's the right time and so then you have Desire also to recognize object with your camera on your smartphone And so that's how you end up with a vision processor Which again had have a couple of thousand multiplication units and may do that recognition in a snap without actually consuming a lot of power and so what so we may see here actually the first part is A imp cluster which an imp stands for a symmetric multiprocessor cluster Which means we have two completely different CPUs here And so we have to deal somehow with that and we have something more generic, which is smp cluster Which means we have which is which stands for Symmetric multiprocessing which means we have exactly the same CPUs here And you may see you that's an entire picture is a quite a good example of modern heterogeneous So see and most of the so sees these they they look like that So you see they're quite complex because you use different processors You use different means of communications between them and now let's take a deeper look at so those different types of Multiprocessing implementation the first is a imp and the first because that's the simplest in some sense because you may take Completely different CPUs which will differ even in instruction set architecture They may have access to different peripherals different memories and all that and silly have it into Same one design into same into same so see and so that will work They may use different communication channels or they may not have any communication channels as well You have a lot of flexibility here, but for that you have to pay It's hard to implement something which works for any design and so it's hard to add yet another core because You will need so to think about so software partitioning before you actually deploy that on your so see on your device and It's hard to update it later. You need to update either the entire firmware or if you have fbj as one of the Members here you have even harder time So it's not scalable and you need to think and advance a lot But that's simple another thing you have smp which stands for symmetric multiprocessing that is easier and harder at the same time it's easier because You may for example run the same software Which you need to either recompile or even use it in non recompilable form because it may accommodate from one to say Four cores and it will scale automatically on the on those execution units that you have with help of its built-in scheduler for example But what is more complex because it really it has quite a strong requirements even you want to run the same Binary the same software on all the cores you need to have obviously the same memory You wish with it which they share and see if you have caches you need to even to have That those caches coherence which itself adds even hardware problems But what else here you need to think about scheduling in a runtime that's good and that's bad That's good because you don't need to think about that in advance before you actually Deploy that but then you need to implement a scheduler in such a good way So that it may actually allow you to use your performance of your hardware in the best way and not spending Wasting time on doing something useless So yeah, and you need to think about load balancing ability to probably pin your tasks to some core So there are quite a lot of things you need to think about So why we are talking about that's in conjunction with zephyr artos Operating system allows us to simplify development quite a lot because it allows us to use already implemented abstractions And especially talking about multicore designs if you do if you want to do it from scratch and entirely yourself You need to think about scheduling Interfaces for communications and so drivers and all that but when you use already Existing operating system, which at least provides you drivers and some subsystems. This is completely Different situation that's much easier and seen in the best case when you are When your board is already supported what you need to do. Yeah, just you just need to basically Create a simple application which will print hello world. And so it will be printed on your console So all that's a benefit and that's why we are implementing a special support of Either amp or smp in zephyr as well So what we have for amp is in zephyr from this from the very first commit amp was already there And so there was a platform which is now discontinued which is called Arduino or genuino 101 which was produced by intel and so there we had two different cores intel x86 and arc EMP or and they were working together. They were on the same so see and so They used only shared memory and some control systems control signals for communication with each other Well, in fact, only x86 core was able to tell something to Signal something to our core, but no the other way around but they shared memory. So they had a channel for data exchange But now we have more of them. We have an xp Board we have st and so some other sense. I think in future we'll see more And here you may see how that's Actually communication between core used to work on x86 core on power on we just Put some value in a control register which Then generated the signal to actually start arc core, which then signal to k if it was able to start or not So it was that simple and we didn't need to implement some other funky stuff still It was true multicore system Also in zephyr, we have something called open amp that's a de facto standard for a amp system So it allows a lot of flexibility. It uses verge.io for as a transport interface So it could be used for completely different situations for different designs and I recommend if you have any interest about that to take a look at that Presentation which was done on linaro connect Last year I think So you have a link here download my slides and so you'll be able to get much more information about that But in zephyr, we have open amp Also, we have Sort of a sub version of open amp which only use rpm messages So you cannot with that control execution or a life cycle of your software But so you may basically exchange data between cores It is very tiny It is so small and nice and very suitable for really deeply embedded things And again, if you want so to get more details, you have another link for another presentation Which covers that in much more details and so now Talking about smp in zephyr, uh, it's started to appear much later and so probably you'll understand why So, uh, it's first was implemented in february 2018th for esp boards and so That was interesting because that boards That so c has actually two cores But so the only way of communication between two cores is shared memory. They don't even cross core interrupts, which I will Touch a little bit later and so then One year later exactly exactly, uh, there was so support for uh, smp for x86 In qmu and so one of the reasons for that's actually a 64 bit version x86 to be supported was to leverage smp in Simulation because before that you had to use that particular boards Which you may not have on your desk and now with qmu you may actually play with smp on any computer So that that's quite convenient And then finally a couple of months ago. We introduced smp support for architecture and That's interesting because that's the real first real hardware, which is Fully supported for smp because we have not only shared memory, but we have cross core interrupts and That really helps to implement smp in a very efficient way And we support as well simulation platform and real board as well Now speaking about smp in a little bit more details Uh So still there are a couple of things we might improve because so these are still early days Uh, but anyways, uh, you know at these points we use Again shared memory we use cross core interrupts to inform another core that's uh, that core needs to drop all that task Which being used which being executed and switched to something More important for example, uh, we use so cluster wise. So close. So every core might, uh, get some Feeling of if it's the time to do something else for example to run a scheduler again and so We use Atomic instructions because otherwise it's hard to implement synchronization primitives here as well Uh, another interesting feature which was introduced not a long ago is ability to pin your task for a particular cpu core as I mentioned Sometimes we want to do that and so the reason we want to do that because actually Migration of a task from one cpu core to another might be quite costly because It might not be seen to the software developer, but internally we have a lot of our cache states Like obviously we have instruction caches. We have data caches. We have multiple level of caches, but it's not tall We have branch predictor caches. We have Uh, if we use mmu, which we don't use in zephyr so far We have mmu and for mmu we have tlb cache and so all of that When we go from one core to another we lose all that's Cache information and that means we need to get it first Before we can go full steam and that's important integration, especially thinking about zephyr, which is Therefore very deeply embedded things and every cycle is very important for us. We cannot so spend 1000 cycles just for task switching But anyways, there is a way to pin a task even though only With one type of scheduler, which is called dump scheduler, which just a normal queue of Threats and so why don't we have it for other type of schedulers because For others you may have for example prior higher priority and task of higher priority in our task by definition won't be Scheduled to wait it will execute until it is done So if we have some task which has higher priority and it is always executed It will occupy that core infinitely. So it's not a problem So what are we going to do else? Obviously, we need to add more platform because so far we have as I mentioned extents and we have x86 Which makes not much sense in a deeply embedded world and arc. Obviously, we need to add arm there risk 5 And whatever meeps probably if it is still a life of any interest Then we want to add more benchmarks and Tests so you may get a feeling how your simply implementation really works and there are quite some quirks We have a very basic test so far and I used it personally for development But again, we need more and so we'll get to that a little bit later as well Well at some point we may want to add more cores to the cluster And so there is no real technical limitation of supporting four cores so far But just because we never used any more and That's why we were happy and it's easier to add more and so obviously we want to think about so more complicated And more you know smart scheduler mechanism, which will take into account exactly those Things we need so to care about like scheduling penalty and peculiarities of this particular cpu or this particular design Now speaking about things that we had to do in In zephyr to support smp We had to do quite a lot of things and so still what's interesting Most of these things they were architecture independent, which means whenever we get that done All the architectures already may benefit from that. They may just be used as a force and only implement one tiny thing which is Functionality which actually threat switches threat That's pretty much the one thing that is required from the architecture So what we had to do we had to add any civilization of Slave cores because previously when we had only one core and one execution one execution unit Essentially, we had to initialize that but now we need to initialize those cores Which might even start halted or they might start running and we need to hold them initialize And to let them run once again executing already useful stuff Also, we need we had to rework All locking primitives because when you have only one execution you need the only Problem you may face is interrupt because that's how you may get to execution of the same codes Which you used to execute before interrupt But in case of multiple cores you may get to execution of the same core of the same code Just by another core and now we need to think about spin locks So we may have a critical section which is not entered by anybody else While one executor is already using that and that's that obviously adds complexity and We pay for that because whenever two cores want to access the same Critical code path Some of them will wait just wasting time which is not good, but we have to do that Also, we had to improve scheduler In such a way so we know that we have a couple of execution units again before we had just one execution unit We have a long list of threads for example, and we just executed one by one now We have multiple execution units and we need to think about how to schedule those Multiple tasks on different CPUs simultaneously or like one by one So while there were quite a few complexities and in the end we had to implement a little bit different scheduling not scheduling but task switching mechanism because before we had a couple of logs there in the code and Given that code is being executed being implemented in low-level assembly now We didn't want to have any logs implemented in assembly. So that's why we Moved most of the code to generic parts and only left a very minimal amount of Implementation which is architecture specific and obviously is reading an assembly So quite a lot of things were done and so it worked quite well now Now speaking about so harder peculiarities, we need to think about when we are talking about a true smp system And so it it has to do not only with zephyr, but any other operating system or no operating system as well So here you see a block diagram of again quite more than soc which consists of two cores and some other things So what is important when we are talking about smp again? We need to have exactly the same instruction set architecture so that we may use exactly the same binary On all execute the same binary by all execution units And we need to have shared memory and with shared memory. It's not that easy I mentioned already that if you have cash today have to be coherent But so there is another thing most of or at least a lot of embedded systems What they have they have Very fast on board on cheap memory, which is nice because instead of a couple of hundred cycles latency You have like one or two cycles, which is nice But as it turned out since we came from single core designs a lot of those memories Might be so-called private which means they are accept access only from one core And in our case it doesn't work because given we use the same variable Which is supposed to be mapped to address x we can access it from one core and write something But in other core won't read whatever previous core the first core Rolled there. So then we need to use only memory Which is shared between all the cpus and visible exactly in the same way So it might be a situation when you may access that private memory from another core through some debug interface For example, but it won't work. You need to access exactly the same variable at exactly the same address and read the same value Which was written before by now by another core. So you have to be careful with that Also, it's important to have ability to To to you to implement interrupt between course. So that's one core may signal another because otherwise what happens if we have We if we don't have interrupts between course One course start to execute a task and to execute that Until we decide it needs to stop execution And only then it starts scheduler and may pick up another thread another execution another application But what if we We know so far that we need to drop the execution of that thing because we have something of higher priority and Since we don't have any way to To inform that core from outside We'll just need to wait until it gets interrupt from the timer For example run scheduler and then understand. Okay, I need to do something else But if we have that ability to inform it like to force Do it something or at least trigger execution of the scheduler. It helps to lower latencies Significantly. So that's a really hard requirement And so cluster clock as I mentioned to us is also important because we It's much easier to track time and to know whenever I need to do something And so that's how it really works now speaking about Challenges we may face develop and software for smp system We need to think about scheduling because we do it in runtime and so if we don't don't do it right then we may lose our performance for nothing And we need to think about migration costs for between cpu's and so we need to do it for benchmarking otherwise, so it's hard to To understand what's really going on and we have to take care about shared resources because we cannot use The same peripheral simultaneously by two cores and so then we need to implement all those locking and That's how we may lose some performance as well And so now I Wanted to show that how easy it is So if your platform and your board is supported things after configuration utility We just say okay. I want to use smp. I have a couple of cores and you reboot and get it executed So that that's that it is that simple and so speaking about tests, which we are not so having of enough quantity To actually measure how we may use what kind of scaling we may get with multiple cores I had to implement my own application which is yet to be accepted even though there are no more comments And probably it will be pulled in like any day now. So what the application does it Creates multiple threads and in which of the threads. It's just come computes p in certain With a certain amount of four symbols with a certain precision and so so when I Compiled it and executed and used a two different precision. So one of us 120 digits and another was twice more. So you may see here How performance actually scales and that's really nice to see that when you have Task which consumes quite a lot of time. I mean it is not interrupted quite rapidly You may actually get almost three and a half times Performance bump, which is good. It means you have four cores. You get almost four times More Stuff to be done. But when you start decreasing the amount of work you do in one in one task You may see how Significantly your performance may drop. So even you have four cores. You barely get to two and a four X but that's not all so if I get back to all stuff we run on linux And so that's one of the tests of embc benchmark embc multi bench In that benchmark you may see that even execution on four cores Doesn't give us even Two times of improvement. So in that basically means dependent on you on your use case You may either get quite a nice improvement of performance or you may get pretty much nothing. So it's important to really make your Good profiling not even estimation, but preferably profiling and figure out what is your workload and how you may improve on that So getting to the end of that quite a short talk So what I wanted to highlight here So zephyr provides you with enough capabilities to leverage multiple designs of different types be it Simple and ordinary smp or it is heterogeneous system because together with smp You may have as I showed on my blog diagram before you may have different cpu So you have really smp plus amp as well it is also possible And so that's what we typically have and so that's always good to have in sync You are Harder team and software team so that you have harder which is already Probably supported in your software and from software standpoint you have all the required hardware interfaces and mechanisms that you'd like to actually use in your software and so then I Invite you to participate in zephyr development. So that's a quite nice community with all the development on github You are welcome with your bug reports pull requests and So I'll be happy to see more people contributing All thanks. So that's pretty much it from my side. I'll be happy to answer questions if you have any we have about five minutes for that Sure, uh, are we going to provide a microphone or something? Do I need to turn it on the cake? But it won't be captured on the record Okay So, uh, I noticed in your code example for calculating pie that you were using cooperative threats Um, doesn't using cooperative threats and smp and running them in multiple course break the cooperative Excuse me. Could I please repeat this again? Yeah, so in your in your code example You were using cooperative threats unless I'm mistaken because you're using the macro to create a Threads of cooperative priority And cooperative threats by definition they execute sequentially if you execute them in in multiple course that means they're executing at the same time so shouldn't In parallel. Yeah, but that's the point So the the point of a cooperative threat the theory is that you you know that you're not going to be preempted by another Cooperative threat at the very least or by only by an isr really. Yes by doing that and you're breaking the cooperative threat contract Well here intentionally where I did I just made priority of those threats first Priority is the same for all those extra threads and it is even higher than your main thread So that's all the computational power that you have will be used by your threads and given they have the same priority They will be scheduled to as many CPUs as you as you have and then whenever one task ends Another one gets scheduled to these CPU and that's why I used here 16 threads and given I run it on tour for core configuration. It means we have an enough of reruns of the pretty much the same threat on different CPUs Yeah, I do understand that but I wonder if Zephyr should rethink whether a cooperative threat Or two cooperative threat should be Allowed to run concurrently on two cores because if you did that for certain subsystems that we have now that rely on The fact that cooperative threads cannot be preempted by a or cannot run Concurrently with another cooperative threat that's going to break. So some of the subsystems. So You know what I mean? Well, probably we'd better discuss it offline because I don't quite understand and I like to Figure out what I just wanted to raise it because we saw that and then immediately thought of some subsystems that rely on that fact If neither does you are going to compile that for an s&p system and those cooperative threads were actually scheduled to run a multiple CPUs That would break today. Yeah, that's okay anybody else Yes, I have one question I seen in slides that you are introducing the global s&p lock and about 12 years ago There was huge work in our big operating system like linux bsd and so on to basically get rid of that concept because It slows down whole kernel This is perceived as temporal solution until per subsystems locks will appear or Because of simplicity of the Zephyr we are going to have gandlock I think we'll go in the same way as linux development goes. We are not so planning that far in advance So as long as it works, so that's good. I understand there are quite a few limitations of that But see if there is any better Solution and somebody is willing to fix that because it really hits him on performance side I'm pretty sure that will be done. But so so far Again, what I mentioned we are quite in early days here And so I don't think there are any real Products which use that right so whenever we start seeing people implementing That in real products essentially will get that fixed I think that on the very beginning The the gandlock is something which is perceived as easy to use and solving the problem. The question is Could we basically avoid the mistake made by the others well, probably so I don't have any ready enter like I don't Probably anas may correct me, but we don't have anything in the issues filed on that Like we want to get rid of that Okay, I will file it back. Sure. That's good Just a question about open up last year, there was a big step in open home release That should improve open up this Please talk a little bit louder. I barely make So last year there was a an open home release that should fix a part of the consumption I would like to to know if you You compare Rpmc elite on open up based on the old release or the I would say the release which now integrating in the Oh, yeah, sorry. I I cannot sexually understand Your creation may make sense very bad area between our power French and so In october 2018 Open ont have been released With what? NIMP, sorry, NIMP has been released with some improvements in terms of footprint in terms of api decoupling of remote property usage And I would like to know if your status you show here was based on this new release Or was based on the the previous release Well, personally, I Haven't been dealing with open amp quite a lot. I think I see marine there. So probably she knows a little bit better What is the status on that? I just want to say that that And nor did we've created the benchmark to compare open amp and rp message light and we've actually made it public on a branch so The if that's sort of where you're going We and yes, zephyr integrates the latest version of open amp minus some commits in the head So, uh, so I don't know where those numbers came from. But if you're talking about benchmarking open amp in the context of zephyr Yes, we do have the latest Version and yes, we've run benchmarks to compare it to other amp systems And if you're interested, there's a branch there to actually run it on an um on the expresso a pc 50 I can't remember the number but that one Which was mentioned in the slides. So Okay, thank you Do you have any more time or we need to conclude on that? Okay, so thanks a lot. So I'll be happy to answer any other questions