 So, this session would be like a primer for asymmetry multiprocessing. Is there anyone here that has already experienced asymmetrical multiprocessing or work into that field? Zero? Okay. So, I would be very like high level. Then I would be like supported by Laura. That's a colleague of mine. We work at Kinetics. We do embedded software, especially in the Android space. So what we do is like heavily customize OS space on Android. This is like our experience that we built on top of the asymmetry multiprocessing. And let me start how we got like involved with asymmetry multiprocessing, I guess. I'm glad that you came because I'm talking about this. So we have been involved into an open source project called The Warp. The Warp was like a wearable reference platform based on IMX6. And because it was addressing the wearable market, you know, we were like talking about hey, how can we make wearables to do things like in order to save battery time? Because if you have a full operating system on your watch and because you want to display graphic, power consumption is really heavily a driver for your life cycle. What we thought was interesting, and here we have the colleagues of revolution robotics that they developed the hardware for this community board. We actually thought about an hybrid architecture. Hybrid architecture is really simple. We have like one board that has a microprocessor on it, and we have an interface that is like based on your that goes into a daughter board that is featured in a microcontroller. Hybrid architecture means that we have two different operating systems working on like or bare metal code on the microcontroller. We use actually K16 from NXP and a full operating system on the embedded board featured in the IMX6. We use the solo light. Actually the board, the picture is there, it's a very tiny board. And the idea was to run like MCU driven tasks on the daughter board like a pedometer, right? You have a pedometer on the daughter board, you want to get her data from the daughter board and then eventually send those data to the application processor for displaying if you have an app. This was like an old slide, a very high level slide to how we handle software in an architecture like that. Basically we invented something for scratch, right? And it was like pretty effective, but not really optimized. So this is actually based on a high level language like Java. So we thought about, okay, let's say that we have something written in Java and we want to get her data from the pedometer, how can we handle that case? So basically we have sensor attached to the MCU, let's say like a pedometer. And then we have a code that like handling messages through a JSON, for example, streaming. So you serialize all the information to JSON messages and you send over your and the UART is connected to the SOC and then you have your JVM and your, I don't know, like whatever library, right? Basically because you are dealing with UART, you probably need another library, these are XTX on top of it, in order for Java to handle data coming from the UART. As you can see, this is like a very interesting solution. But when the IMX7, so when we have these little packages that are carrying both microprocessor and microcontroller on it, the use cases were like pretty extended. In other words, we were not talking anymore like battery savings, like how can we build a wearable device which can have a more like longevity in terms of lifespan. But also, can we run software in two different areas, completely segregated? So we have an MCU domain when we run RTOS or bare metal code. And then on the CPU, we are running a full OS can be a Linux or Android. So you probably are very familiar with the SMP. So SMP is a way for a single OS to deal with multicore. So because every single core has a limitation in terms of speed, right? The only way you can optimize the time and the performances is about splitting things. So your operating system is taking care about splitting all the tasks on, not tasks, but splitting threads and such over multicore, right? And then you have apps on top of that that doesn't know anything about this. But you gain in performances because you are splitting everything along your multicore platform. Different things happens if you have like an asymmetrical architecture. So asymmetrical means that you have different OSes that runs on different cores. This is not really something new because if you're familiar with framework like a mentor graphics framework or Green Hill, they do those things like heavily in the automotive on the aerospace or flight control systems. So what happens if I have a deterministic software that has to run on a microcontroller? And then I have a completely different domain where I would say display something on the screen and I don't want the two domain to be connected in order. For example, if I have a kernel panic, I don't actually panic. What is going on on the RTOS or in the bare metal code? Basically how to deal with asymmetric processing is done through a special layer of API. So right now you have two different operating systems and these two operating systems are tossed because you may have two different full OS running on different cores or you may have like a CPU or a multicore CPU or if you have in the same package a microcontroller, you can run RTOS or bare metal code. But how these two worlds actually communicate together? So we are here just to talk about exactly how to handle this communication that is driven by API basically. So there is like a layer that does what is called the inter-processor communication. The sweet things is like it is a very segregated environment. So you can really take advantage of having systems that they don't know anything about each other other than what you want to communicate between each other. So always you may have different choices. One of the very basic difference between very initial critical system and like even if you are working with segregated environment, not mission critical environment is the use of an hypervisor. Again if you go back to products like mental graphics, they use an hypervisor to manage axiomatic multiprocessing. That means that the two CPU they don't communicate each other directly. They go through an hypervisor that means that if something goes bad is the hypervisor that takes care about a kernel panic. So I wanted to reboot eventually a guest operating system and the two CPU are completely disconnected each other. So they don't have like a connection that is a direct connection between them. They are not sharing any interrupt basically. So you have strong isolation in that case, actually a very secure and robust system. But you are like a layer of software. The hypervisor in the embedded world and probably there is people that knows more than me in that domain, they can be divided in different areas. Heavy in the embedded world, monolithic hypervisor, so monolithic visualization is heavily used. Monolithic visualization is what, so the hypervisor is taking care of drivers basically. When we install VMware in our Mac, we don't have, we don't need special drivers. We use the guest OS drivers basically to handle all the input output operation in the guest OS. But like Sen does like this, but in the embedded side if you want a more, I would say more performances, you may want to have like a monolithic type of visualization. That means that yes, you have to write your own driver for the hypervisor. But luckily, we were taking a look to some technology in that space. They share the same Linux header for the driver. So it's not so difficult to recompile if you have to recompile a driver that you want to share with your guest OS. On the other side, we have not supervised solutions. So this is what we will see today. So you can basically have like a quite reliable system. Not such the first one, but quite good anyway. So with not supervised, so not supervised or no visualization, and you have the CPU that are actually sharing interrupts with each other when something happens and you want to share messages between the two. And we will see demo and examples in this specific case. What is very important about what we would be seeing, especially with Laura, we'll be talking about details of what we will develop and what we discovered lately, is okay, how this inter-process communication actually works. Basically think about like an iso-osy layer type of thing. So when we have like a physical layer, right, like the TCP, oh sorry, like the, when we talk about networking, right? So we have like a physical layer that in this case is shared memory. And then you have like a media access control layer that is the layer number two. So the link layer that in this case is done by VIRT's IO. VIRT's IO probably is who is familiar with KVM. So with the para-virtualization of KVM, VIRT's IO is what takes care about taking memory buffers and putting a queue that is like a ring buffer, right? And the other core take those memory buffers from the queue and process it and send an interrupt when it's done. And then we have finally the transport layer where the primary driver or device, you know, like, so there is a driver and a device, it's called RP message, it's a framework. So basically OpenAMP, I don't know if you guys are familiar with OpenAMP, OpenAMP is providing frameworks for life cycle and message management between different cores in asymmetrical architecture. RP message is built on top of VIRT's IO actually is an API. So you have really API driven in C and is provided both for the RTOS and actually the Linux kernel already had VIRT IO and RP message on mainstream kernel. But OpenAMP add the same features to bare metal and real-time operating system that was not available before. So they did like a great job. So I guess that now we get into the specific things we have done with the IMX7 and I call Laura that we did like a recently a lot of work on the topic. And she will be presenting the type of words we have working on, which framework we used, which tool we use for debugging, and finally the demo that we did to show you how these inter-processed communication actually works in practice. Okay. Thank you so much, Laura. Please welcome. It's the counter. Okay. So hello, everyone. Thank you for being here and thank you, Nicola, for the introduction. So I will guide you to the steps that we took to develop an application on an hybrid system. We chose to develop on the NXP IMX7 processor, which has a Cortex-A7 core alongside a Cortex-M4 core in the same SOC. We have a master slave architecture where the Cortex-A7 is the master core and the M4 is the slave. This means that the A7 is actually in charge of loading the firmware and actually release the Cortex-M4 from reset mode, so it is responsible for booting the M4. And from that point on, they run completely independently from each other, and they actually run at different speeds. So we need an inter-processed communication protocol to ensure that the two cores can communicate between each other, and the hardware module on the IMX7 board that it is enabling this kind of communication is the messaging unit, while the software component that enables communication is the RP message component, which uses the open-arm framework. We will see a little bit more in details of this component. The two cores in an asymmetric multiprocessing system actually access the same peripherals and memory areas, so we make sure that we have a mechanism to avoid conflicts. We will go into details of this. So the RDC is the resource domain controller, and it's a hardware module of the IMX7 processor that allows to create isolated resource domains. So basically, each peripheral or memory area can be assigned to one or more domain. Here in the picture, you can see we have the application core, which uses these kind of peripherals exclusively, and then you can have, of course, shared peripherals. So the RDC makes sure that access to shared resources is controlled and that there are no conflicts. This is actually the component, the hardware component on the IMX7 processor that enables passing messages between the cores. So we basically have a set of registers and interrupts. We have two sets of matching registers, one on each side, and we need two because the two processors can work at different clock speeds, so we need a way to synchronize them when there are passing messages between each other. Of course, you have interrupts to signal the other core when new data is available. And you have also four general purpose interrupts on each side that you can use for general purpose signaling. We actually used in our demo application some of these general purpose interrupts. We will see later how. So as Nicola said before, the software component that enables communication, inter-processor communication is RP MSG, and it sits on top of the Virtio bus. So this is a quick overview of how Virtio is implemented in the RP message framework. We have, in the Virtio implementation, we have the V-ring data structure, which has some V-ring descriptors, which are basically pointers to shared buffers in memory, which represents our message, our actual message. And the VirtQ that we see here on the left is a user abstraction that provides access to this V-ring structure and provides some APIs as well. So basically, our RP message component uses this VirtQ abstraction to send and receive messages in the form of buffers in shared memory. From a higher level view, we have the RP message component design here. Each remote core in the RP message framework is represented as a RP message device and provides a communication channel through the master core. So the RP message device is also known as a RP message channel. And it has an ID, which is a textual name that is used to identify each RP message channel. It has a source address and a destination address. And on top of the RP message channel, we have logical endpoints that are provided by, we have a logical connection, sorry, on top of the RP message channels, which are provided by these endpoints that are represented here as these pipes. And each endpoint has a local address and a callback function associated to this address. So every time a new message arrives on the channel and has the destination address equal to the source address of the endpoint, the callback function is immediately called and it can handle the payload of the message. One thing to keep in mind is that currently in the RP message implementation in Linux, the channel is only dynamically created. So the remote service sends a special message to the RP message bus, which says, like, I want to instantiate a new channel. The RP message bus takes care of this message, creates the channel. And from that point on, an endpoint, a default endpoint is created and communication can go on from this point. So this is the actual implementation of the RP message component on Linux. The remote core is, of course, represented as a platform device, which stands on the platform bus. We have this driver, which is the IMX RP message custom platform driver, which takes care of this remote platform device and registers a virtual device. Then we have another driver, which is the virtual RP message bus driver, which is like a bridge from the virtual bus to the RP message bus and registers an RP message device. Here we have one last custom driver, which is the RP message car driver, which is quite recent, has been introduced in the 4.11 kernel Linux version. And, of course, it registers itself as an RP message driver. And as soon as it is probed, it creates this character device here, which represents our remote core. So it basically exports the remote core abstraction as a character device, which is really useful if you have to write like a new space application and have to deal with communication with this remote core. So as soon as we have this device available, the user space application can do an IOCTL operation on this kind of device and request the creation of an endpoint. As soon as this happens, another character device is created, which represents the endpoint. And it's this dev RP message X character device. And from this point on, user space has access to the endpoint and can send and receive RP message messages just by writing and reading on the character device. So we developed a demo application on an IMX7 baseboard, which is the boundary devices, nitrogen 7. We are actually replicating the same demo on a carrier board by Toradex with the IMX7 sum. And the goal of our demo was to have free RTOS running on the Cortex-M4 and Linux, I think the version was 4.9 on the Cortex-A7. So on the Cortex-M4 side, we are pulling an IMU, so inertial measurement unit. We are reading data from the sensor and we want to send them to the master core and visualize them somehow. One particular use case that we wanted to address was to safely recover from a kernel panic on the master core. We will see later all the details. But first, I wanted to introduce you on how to build an environment or prepare your environment for developing an application on an AMP system. So the first thing you want to do is bring up your remote core. And of course, you have to download the free RTOS sources, the GNU GCC tool chain. And you're pretty much ready to go because in the free RTOS sources you have a subdirectory, which is examples that has a bunch of applications already ready to be built for multiple boards. And you also have an ARM GCC subfolder where you have build scripts. So you basically can just launch those build scripts from shell and have your binary run away. Second step is to actually deploy the application on the Cortex-M4. You have a couple of choices. I usually prefer to do it from UBUT. So I use the UBUT storage gadget to mount DMC on my PC and I just drag and drop my binary. And then I use the M4 update. UBUT command to actually deploy the application on the Cortex-M4. You have a couple other choices. You can use also remote prop framework, which is still part of the open AMP framework and all documentation is under the open AMP documentation. Or you can use the M4 firmware loader from NXP. So either of these methods will do the job. It basically depends on when you want your Cortex-M4 to be started. Then you can choose from which memory you want to run your application from. The preferred memory is usually the TCAM, so the tightly coupled memory, because it's a small, fast memory dedicated entirely to the Cortex-M4. The drawback is that it's only 32 kilobytes. So you have to make sure that your free artist application is optimized to fit into this small size. So as I said before, you can build your applications from Shell very easily by using the build scripts. But you can choose, of course, to use an IDE. You have a couple of commercial choices. Here I reported DS5 or Society Codebench that they have a lot of tools, especially tailored for AMP debugging. I decided actually to go with the plain Eclipse way, so I just took Eclipse, and I equipped it with a few tools that can help debugging on AMP systems. I was using a J-Link Sagger Debugger, so I needed these set of tools, which are the GNU MCU Eclipse plugins, to add support for my probe. Then, of course, you will need GDB. If you're using as well J-Link Debugger, you will need J-Link Scripts to debug both the Cortex-A7 core and the Cortex-M4 simultaneously. And then I used also the Free Artist Canal Awareness plugin from NXP, which is not... It's the only thing that it's not open source, but it's really useful to debug Free Artist application. You can see the status of your tasks, stack usage, heap usage, and everything that's useful for this kind of debugging. So after you set up your plain Eclipse IDE, it should look like this. You have, of course, the classic breakpoints view. Right now I have only one breakpoint in the MCU, but, of course, you can have multiple breakpoints on the NPU as well. You have your Free Artist Canal Awareness plugin for debugging the tasks. And then, of course, you have consoles for debugging both the Cortex-M4 and the Cortex-A7 simultaneously. So back to our demo application that we developed. On the remote core, we have Free Artist application, and we sample the inertial measurement unit every 10 milliseconds. We calculate an objective function. In our case, we were basically calculating the module of the vectors of the accelerometer, magnetometer, and gyroscope data. Then we are storing these samples on a buffer, which is, in our case, is only 300 elements. And the reason that we use this queue, this size for our queue, is that we wanted to store our application on the TCM, so we only had, like, 32 kilobytes, so we have a pretty small buffer. But it's an offer for the demo we are studying. And then items are dequeued from this buffer and sent to the master core for visualization. So we sent 10 items at a time and every 100 milliseconds. On the master core, we have a much simpler application. It's a user space Linux application that just pulls the character device, which represents the RP message endpoint, and dumps the data received on our text file. So here we have an overview of the architecture of the demo application. And on the left side, on the right side, sorry, we have the Cortex-M4, which is running free RTOS, and we have two asynchronous tasks. We have the email polling task, which actually continuously pulls the email and encues our samples in the share buffer. And then we have the data sender task, which is in charge for the initialization of the RP message channel. And it is in charge of dequeuing samples from the shared queue and sending it to the master core. Here we can see the RP message channel, as well as the endpoints. And on the left side, we have the Linux side, so our master core, which is running Linux. We can see the RP message car kernel driver, which exports our endpoint data as character devices. And finally, we have our user space application that interacts with the character devices to read data and dump data on a text file. We also used three general purpose interrupts that were provided by the messaging unit to actually signal some states of the master to the remote core. These are all triggered by the master towards the remote core. We call them startCMD, stopCMD, and Harbit. So startCMD is an interrupt that it's used to signal that the master has created the endpoint and it's ready to receive data. StopCMD is used to signal that the endpoint has been destroyed. And this actually happens every time we close the RP message car client applications, since there's no need of keep receiving data if we're not using it. And last one is the Harbit. So we needed a way for the Cortex-M4 to know whether the Cortex-A7 was still up and running or if it crashed for some reason. So basically we're generating an interrupt from the master periodically and the Cortex-M4 just checks if these interrupts arrived at certain times. And if it doesn't, it just assumes that the master is dead for some reason. So yes, this was the previous image and you can see we have the same, pretty much the same structure. Here we have the control flow on the two cores. So on the left side, we have RP message car client on the Linux, so the Linux user space application. We have a first transition from a state where the RP message channel is not created to a state where it's up. Then the user application just opens the dev RP message zero character device and with this action, the endpoint is actually created. And then we stay in this S2 state which just keeps reading data in coming to the RP message channel. And then when we want to close our application we just close the dev RP message zero character device. The endpoint is destroyed and that's it. On the right side, we have the flow of the data sender task on the free RTOS side. Also here we have a first transition from where the channel is not there and to the state where the channel is created. Then we just, our task just waits for start CMD interrupt, goes into this state where it just keeps sending the email samples to the master core and checks for the master heartbeat. If stop CMD interrupt arrives, it just goes back to just reading samples from the email and buffering data. And if the master heartbeat doesn't arrive on the expected time, the MCU just assumes that the NPU has crashed and goes back, destroys the channel and goes back to first state. So we had to figure out what to do if the Linux kernel panic on the master side. Using a watchdog to just trigger a full reboot was not an option here because in the IMX7 architecture, if you reboot your Cortex-A7 core, also the M4 core will be rebooted. We still wanted the application to run, to keep running on the Cortex-M4 and just reboot the Linux kernel on the A7. So the solution we came up with was to use Kexac, which allows to load kernel from the currently running kernel. And the reason we chose Kexac is that it does not reset all the hardware devices, but it's just like a soft reboot. And so it doesn't actually touch our application running on the Cortex-M4. This mechanism is usually used in conjunction with Kdump to also dump the crashed kernel memory for further debugging. We actually didn't use Kdump instead of this demo application. We only needed Kexac. We also used two kernel images. One that it is booted every time, normally in a normal boot, and one that it is executed only in case of a kernel panic. So we had to add some extra options both to the normal kernel command line and to the crashed kernel command line. And then we had to use this option to signal that the crashed kernel has to be loaded only in case of a kernel panic or core. Keep in mind that Kexac and Kdump support on our platform is still experimental, so we had some problems with the demo. We solved some problems and we still have some other ones that we need to have a more detailed look at. But we will see what are the problems that are still open in our demo. So we have a video of the demo. So what we have is on the left side, we have the Cortex-A7, which is booted in Linux. And on the right side, we have our Cortex-M4 with freeRTOS. You can also see that the freeRTOS application has already started and has initialized the sensors and it's actually already sampling the emu and buffering our data. So as soon as the Cortex-A7 is up, we just log in and we launch our RP Message Car Client application, which is the user space application. And we can see that the data is starting to flow from the Cortex-M4 to the Cortex-A7. Yeah, we are writing our samples in our text file. You can see that data is correctly arriving on the master side. Down here we are loading the crush kernel with the options that we saw earlier. And here we are actually causing a kernel panic. So here you can see that the Cortex-M4 said master is dead. So it actually realizes that the master core is not up anymore and it just keeps running. And the Cortex-A7 is rebooting its crush kernel. So as soon as the crush kernel is up and running, the channel, the RP Message channel is recreated again. And we can relaunch our user space application to see that data is still recovered. That's what we're doing here. So Cortex-M4 starts sending all the data that it has sampled and buffered to the Cortex-A7. And data is still arriving on the master side. So that's it. Back to our presentation. Okay, so here are the problems that we encountered during the development of this demo application. First of all, before announcing the remote service, the remote core actually checks whether the master core is up or not. We saw that if this notification arrives too early in the booting process of the crush kernel, the system actually might hang. So we did kind of a workaround by adding a delay between the moment that the remote core realizes that the master core is dead and the moment where it tries to re-initiate the RP Message channel. But we didn't understand very well what the system hangs. Sometimes when this notification arrives too early. And the other problem, which is a bigger one, is that key execs still hang sometimes and it happens like randomly. It's really hard to debug. And, but we saw that it happens more frequently when the flow of data is continuous from the remote core to the master core. So I think that something's wrong with the interrupts that are arriving. So if there are too many interrupts, the system might hang. But we will need to do some further investigation on this problem because it's hard to debug and we didn't solve it on time. So that's it. If you have any questions. Oh yeah, sorry. Questions, yep. Yes, actually from the, on the remote core side, you use RDC to actually say which peripherals and which memory areas are dedicated to the real-time core and which one are not. So they're completely independent and access to resources is safe. Yes. If the master decides to send interrupts, do the more. Yes, yes. Yes, this is true. We need a way to make the real-time core aware of the status of the master core. Also, we used the general purpose interrupts provided by the messaging unit. So I just assumed that it was kind of safe to send those kind of interrupts. Yeah, the messaging unit should take care of this actually, but is it, I'm sorry, can you repeat? No, it's just, there is a packet structure, but it's only data written shared memory. It's not like a real packet. It depends on when you want your Cortex-A7 to be started, if you want it to be started before the Cortex-A7 or if you want it to be started later. I didn't actually study performances or I don't really know if there are differences on performances. I always used like the U-boot version method. No, we're using open-up. I mean, freeRTOS has the whole open-up middleware just ready. No, we used the plain open-up implementation. No, I think it can actually work with other, like if you have an RTOS and another RTOS. The open-up implementation has a bunch of cases, use cases that you can use and you can also use an FPGA. So there are multiple scenarios where you can use this framework and it's all documented in the open-up documentation. We didn't actually measure our throughput. We didn't actually. Actually, while you're traveling, this is where the play is today because it's at least the next burden angle. Yes. Is this RPMessage to be the category of the players? Yep. I didn't actually understand only, sorry, I can't hear you. Yeah, actually, the RPMessageCard driver uses socket buffers to store incoming data and export it to user space. I don't know if that's was. No, it just uses socket buffers. So we had a sign-up system. We had the IO, it was kind of the devices that you shared between the KDM kernel and everything here is handled by kind of the devices and IO confirmed in the community. Yes. Did you guys gather any performance data like the latency and the hookup and everything? We wanted to, but we didn't actually because we had a hard time to like have all the system with Kexac work. So we didn't make it on time, sorry. Yeah, we. You mean like, you don't want to go through the hypervisor driving more and more, so yeah. So the question was how much effort and we trust how the memory is handled separated by the frame or by the, again, that the same question is not for the layer, the message unit in the IMX6. Is what they're scared from another perspective for those things. So no effort at all. Also, like the, in our case, we were using the boundary devices board. They already provide like a BSP for free artist and as well as kernel sources, which everything is pretty much ready. Also your boot sources. So everything is already taken care of. So the effort of just deploying the kernel and the application. I mean, it's basically straightforward. We didn't actually have to. The more you want to stay on the fields and write your own costs, ask things while you've got a right to scratch. Other than that, everything is pretty much ready in the distribution of our costs on the matter of the board and from the deferment. Yeah. No, you can actually, you have like different power states on the tool. You can also go to power save mode on the Cortex-A7 and just wake up the Cortex-A7 as you want from the Cortex-M4. So power management from that point of view is independent. So yeah, you can do that. You can start operation without you moving the middle. Well, you've got to option how to start the MCU. You start the MCU from you or you start MCU from this. Okay. So that's your mind. You got to start the MCU because actually, the experience we have, like, okay, what if the kernel patch, right? Okay, if I reboot everything, so I unplug the power, the MCU is there. Right. I know there's some other devices that have been developed. No, please. Now, this is controlled by the CPU. The CPU is the faster. Okay, that's good. So you have, like, four cores. So, oh, you want to do more cores? Yeah, more cores. Does it still scale or? No, I don't know. Actually, we have just, like, you can install more cores, even more. But, you know, we have just one. Yeah, we have one. The RMS trimmer actually... It's the same thing. That is all. But it's the same thing. Yes. But, MX7 working depends on you. You have just one answer. Anything else? Okay, thank you. Thank you.