 Okay, so it's 10 o'clock. I think we should start with this talk. Welcome everybody and thank you for joining this talk. My name is Andreas Eimanns. I'm working as a technical advisor for embedded software systems at MBDA Germany. And today I'm giving you a talk about real-time Linux on embedded multi-core processors. So some of you may ask real-time Linux, that's nothing new. Multi-core, that's also nothing new for Linux. So what's the content of this talk? The talk is about the combination of both. So around one year ago, I was in the situation that I had Linux, real-time Linux running on a single-core processor. So hardware becomes more and more obsolete. Every new processor nowadays is a multi-core processor. So the question was, how can I come to the right side, running my existing software on new multi-core hardware? Is this possible? And if, how could this be done? So a few words about the agenda I already started with the motivation. At the beginning I have to say a few words about Linux and real-time, about latency measurements. But then we come to the second part of the presentation, the interesting one, the migration from single-core to multi-core. You will see a lot of information, histograms there. You will find that there are a lot of effects on multi-core processors, which you should be aware of. That it's mandatory to knew a little bit about the processor hardware architecture. So I'm sorry, my laser point is always working on one side of the presentation here. So maybe I switched during the talk. And of course, at the end, there will be a short summary. So the combination of a vanilla Linux kernel and the applied RT-Prem patch, that's nothing new. This is well-established in the embedded area. But on the other side, the semiconductor industry is driving evolution to multi-core. So nowadays, it's a big problem to get a single-core CPU for your system. So there's a solution you can say, take multi-core processors, use one core, switch off all the other cores. Sure, this is a solution, but you waste all the other cores you have. So question is, can we use the other cores? So in this presentation, I will outline one possible way to migrate to multi-core. Of course, this is not the one and only way, the golden way. It's not an answer to every problem. Your system may look different, your hardware, your software may look different. Okay, so when I started one year ago, I had a single-core processor system. It was a 6UVME power PC, a G4 power PC running with 1 kHz. So it's a well-known processor in the embedded area. The hardware becomes more and more obsolete. And the question was, can I replace one or several boards with a new multi-core board? Is this possible? And of course, you have a lot of advantages for multi-cores. You have more computing power on each board. You have less power consumption, if you compare the same CPU power, and you have less heat dissipation. These three points are always interesting in the embedded area. You need more power. You want to have less heat dissipation and power consumption. For example, if you think about a smartphone, these are typical things there. Of course, there are some disadvantages compared to multi-processor systems. So in multi-processor systems, each processor has its own resources, caches, memory, IOs. On multi-core processors, these resources are shared. So what you could expect there are interferences. So that was the hardware part where I started. The software running on the boards is just this Vanilla Linux kernel 4.3 from kernel.org, the according RT-PREM patch. Of course, we did a lot of kernel configuration. It's a full RT system. It's configured to be a tickless system using high resolution timer. We switched off everything we didn't need or which might have a negative impact on real-time behavior. It's hot-plugging. It's power management, dynamic frequency scaling, and it's typical things. Additionally, we used the tool cyclic test. Cyclic test is part of the RT test packages and can be used for latency measurements. So one general note, I have to give you a little bit information about this tool, but the tool is not the content of this presentation. So cyclic test. Who knows the tool or who has experience with cyclic test? Three, four, there are some. Okay, so maybe it's necessary to say a few words about this tool. I call it cyclic test simplified because it's a big tool with a lot of parameters. I just wanted to give you the idea of what the tool does. In general, it measures latency of a response to a stimulus. So how is this done? The tool sleeps for a defined time. It measures the actual time when it's woken up and calculates the difference of the expected and the actual time. So in the perfect world, you would expect the difference is zero. But of course, from an interrupt, from a timing event, until the application is scheduled, there's always some time. So you have at least an offset and sometimes it takes a little bit longer. This is the main loop taken out from the source code, a little bit simplified. You see what I told you before, it's sleeping for a time. It measures the time when it's woken up and it calculates the difference and then it iterates again and again. At the end, cyclic test generates an output. It's doing some kind of histogram in the format which you can directly feed into GNUplot. For more information, there's a lot of things on the web. Just Google a little bit around. For example, this is a very comprehensive presentation about cyclic test, its parameters and so on. But for this talk, it's just necessary that you understand the idea behind cyclic test. So now we started cyclic test on the old single-core processor system, and that's what you see here. This is the histogram generated with GNUplot. You have here on the x-axis the time in microseconds. On the y-axis is the number of samples. So please note that this is a logarithmic scale. It's not linear. If you use a linear scale, you have a big peak here at the beginning and then there's nothing to see here. So to see the curve, it's necessary to use the logarithmic scale. So we put the system under high load. That's necessary if you want to do latency measurements because if you test an idle system, so who wants to deploy an idle system to a customer? And of course, it's necessary to do long-term measurements. It doesn't make sense to run a few minutes. We are talking about days, weeks or better months. So that's necessary to see. Sometimes there are unfrequent outliers and this is what you want to see. So you see here the red curve that is the kernel with the preamp patch and the blue one is the kernel without the preamp patch. So this red curve means here you have a maximum latency around 25 microseconds. That's not bad. Without the preamp patch, you have maybe 70 microseconds, but you could say hope that's sufficient for my system. What's the problem? The problem is that this blue curve doesn't end here. It's going up into this direction there. There are outliers up to 5 milliseconds. So if I make a plot up to 5 milliseconds, you cannot see the curve. So I just cut it here at 100 microseconds. That's the interesting part. And of course, this curve is totally hardware dependent. If you have a different processor, for example, if I have some tests with an ARM A9, then you have a little bit more plateau here. It's going up to 250 microseconds. Okay, that's the old hardware. So the interesting part, we want to migrate to new hardware. Since we had a PowerPC in the past and the legacy software uses some features of the PowerPC, especially the anti-vector unit. So it was the idea to buy a new PowerPC. The new PowerPCs from FreeScale. Now it's NXP. In a few months, it will be Qualcomm. It's not called PowerPC anymore. It's called CoreIQ. So that's the PowerPC family. FreeScale, you see it's abided around one and a half year ago, provides three boards where the CoreIQ processors is built onto and a lot of interfaces are available here. So this is called a reference design board. That's what the RDB is standing for. The T here denotes that it's the newest CoreIQ family. There is 1,000, 2,000 and 4,000 family for the low, mid and high-end processing area. The next two digits denotes how many cores are there. So this is an 8-core system. This is the 24-core system and the fourth digit is just for have some special variants. More or less interfaces, different interfaces and so on. So we buy this hardware and just to give you a short overview, this is the hardware. This is the processor on the hardware. It's the E6500 PowerPC Core. You have your 8 or 24 cores. It's running at 1.8 GHz and you have 4 or 12 gigabytes. So if you have a look at the PowerPC family history, it's a step of more than three PowerPC generation compared to the old G4 one. So from software point of view, NXP delivered the software development kit. It's based on Yocto and you can create your boot images, your kernel, your init renders, your DTB file and so on and so on. And the kernel is a little bit older. It's a 4, 1, 8 and the according pre-end patches applied. So the first attempt was just move the kernel to the new hardware. We let the kernel handle all cores during SMP configuration. We just changed a few configurations in the kernel. It's a new CPU core and it's different hardware but on the general setup we didn't change anything just to see how it's running on the multicore software. Cyclic test in this case is started with an additional parameter to tell cyclic test to run to start one thread on each core and bind the thread to the core using affinity. So and then let's see what cyclic test will report. So that's the histogram here. Interesting thing here this is an idle system. We put no load on it. It's not necessary that you see every curve here. So I know the yellow color is a little bit hard to see. The interesting thing is you have this peak here at the beginning as before but you had a lot of things here. It's up to four cores which generate entries here. So you can say are we done? So the maximum value is around 80 microseconds. Maybe there's a sufficient for your system. Of course we are not done. This is an idle system. So okay the next test is under load and now you see that it's different. We have around four cores. The system is under load. We have a CPU load on each core. We make a big traffic on the ethernet on the serial line and if you start a new run you will see that the colors change here. So this means that in the new run different cores are generating the same peaks and on the previous histogram or the previous histogram ended around 100 microseconds. The highest entry was 80 microseconds. Now we have something above 200 microseconds but that's obvious because the system is under load now. Okay again you can say 200 microseconds that's okay for my system but especially if you are an engineer you want to understand what's going on there and maybe your system has other real-time requirements. Oops so sorry. Of course we did the same measurements on the 24 core system but you can imagine it's really hard to put 24 cores into an histogram. So I just show you here the idle thing but for the further discussion of the next slide I only use the t2080 because it's easier to see. The effects are the same. We also did a lot of investigation on this hardware and if there are differences to the 22080 I will mention this in the presentation. Okay so as you're seeing in one of the first slides we said we configure the system to let the Linux kernel handle all cores and that's what Linux is doing. So the scheduler decides on which cores have to run and the task can be migrated dynamically. So the idea is we bind all tasks to one core for example to core zero. This can be done with a simple bash script just loop over the entries and slash proc and then use the task set command to assign the processes to core zero. If you do so get this picture here. You see that there are two cores here instead of four cores which have entries in this higher region but there are still two cores. So and again the results vary from run to run. You will always see the red one this is core zero. In this case we have core two the blue one and the next one we have four five six seven. It changes from time to time. Okay next thing have a look at the interrupts. Just do a cat on slash proc slash interrupts. You get a longer list and I picked out here the serial interface. You see this interrupt is assigned to 36 and here you see this eight cores and how often a core managed the interrupt. So you can see here the interrupts are handled by different cores. So the idea is we do the same as we did before for applications. We migrate all IRQ handling to one core to core zero. This is what this bash script does and we set the default affinity to core zero for new IR crews. If you do so you get this histogram here. So it looks similar to the previous one but you have less entries here in this area in the previous histogram. There was also a curve like this here. So something changed but we are not completely happy with this. There are still two cores and it changes from run to run which cores are involved. So instead of ongoing this way you can say okay let's try a different approach. We are going to isolate core from the kernel scheduler. Instead of starting the system migrating back threads and IRQ handling to one core we say at the very beginning we tell the kernel to take out cores from the scheduler. So there's a boot parameter for the kernel which is called ISO CPUs. So you can add a list of CPUs or core numbers. So the main page says remove the specified CPUs or core from the general kernel as in P balancing and scheduler algorithm. So and if you do so you want to move your application, your return application, your task to the other cores. You can use again the tool task set to apply to deploy your application to dedicated core. So the idea now is let core zero handle all the kernel and the OS stuff and reserve core one to seven for the G28 for user application. So you use this kernel parameter. Oh there's a typo here without the s at the end. So and let's say let's have a look what happens then. Then you get this histogram. So you see here core zero has a lot of things to do up to 245 microseconds that the highest value here there is no outlier in this histogram and all the other cores are somewhere hidden here in the first peak. So and at the end of each run cyclic test generates a summary. This is can be seen here. This is part of the summary. So for each core which is here in each column score one to eight you get a minimum latency the maximum latency and average latency. So you see here core zero that's the entry here and the maximum value for the other course is 13 microseconds. So we talked a lot about latency measurements. This is necessary. This is important to do this to understand a little bit of your system. But of course you don't want to generate a system where cyclic test is running so you have a real application. If you are in the situation that you have an existing system or you have existing software on single core you can migrate it to the multicore system then you are a happy guy because you have software and you can just run your existing real application on the new hardware and look what happens. If you start a new project of course normally you don't have the software at the beginning. So in this case I recommend to try to do a reference implementation at least of the critical code or code sections so that you have a reference implementation to test on your multicore hardware and see how this behaves. But this is not so easy. At the beginning of course you should have an application which can run parallel. If it's completely sequential it makes no sense to use multicore hardware. Then of course you need time measurement and time measurement is a topic for a complete dedicated talk. Take care of that especially in combination with caching effects. Make sure that your application behaves on the test system as it would do on the final target. Simulator implement messaging if necessary. Normally your application maybe gets some information, gets some data in, has to do something and sends data out. So this is typically much slower than just a computing algorithm. So it's necessary that you implement this. And also if your application does IO typically hard IO hardware is much slower than memory. So it's different if you write your results into memory or into real hardware. Oops and of course you should do long term measurements to find out unfrequent outliers. And last but not least you should check that your application does what it should do. So it makes no sense to optimize your algorithm and do things like this if you are not sure that your application works correctly. As long as you do not know this every time measurement is worthless. Okay in our situation we were in the happy situation we had existing software. So for hardware I took out some code which was a real time critical algorithm and it was available in two versions. We had a pure cc++ implementation and we had a highly optimized version using the LTVEC single instruction multiple data vector unit of the PowerPC. This is a very powerful unit and depending on your application you can gain more than a factor of 4, 10 or sometimes up to 15. So the algorithm can run nearly 100% in parallel. It uses big lookup tables. So big in our case means 5 to 10 megabytes. So it's big enough that it does not fit into the level 2 cache. So we can see caching effects here. And unfortunately we have to simulate storage of data in hardware because the original interfaces are not available on the new hardware. So now we run the algorithm on one core, two core, three cores and so on up to eight cores. So you see the number of cores here on the yx axis and the number of microseconds and one iteration of the algorithm needs to execute. So as expected this curves go down here. You have around a factor of 10 between one and eight cores. This is just the influence of a lot of pipelining, things, additional parts in the processor architecture which beat up a little bit. So it's more than a factor of eight. And if you for example compare four and eight cores it's around a factor of two between them. So this is something you might expect if you say yes I have an eight core system. Now the altevec version. We have a good performance improvement up to five cores but then we are stuck. So it's nearly the same speed for six and seven cores and it's reduced performance for eight cores. So between five and eight cores we have a factor of one. So what's the reason for that? We have an eight core system but there is no speed up anymore. What could be the reason for it? Can you hear me? No? Okay. Okay. So sorry. So what I showed you before was an average of the algorithm. It did a lot of iteration and that's just the average value. So here we printed the number of iterations and the time the algorithm took. So the red line here is the algorithm using one core. So you have a really stable line here. It's always the same time the algorithm needs to execute. This is the same for the two cores, three cores, four cores. Of course all time is going on. Sure. But this is very nice. If you look at the same thing for five cores it's different. Oops. This is five cores here. You see a lot of jitter here. So and the jitter becomes more and more if you move to six cores, seven cores or eight cores. So we have around less than two percent jitter here up to four cores. But for eight cores we have outliers and we have a jitter around 50 to 100 percent of the execution time. So from this picture here I would say we can use four cores even if we get a speed up with five cores, but not more. So but again we have an eight core system. Why can we use only four cores? So if you have a look at ProxCPU info or look at top your system tells you we have eight cores. If you have a look at the menus from MSP it tells you the T2080 has four physical cores and eight virtual cores. So what does this mean? So they call it dual threaded cores. So to understand what's the meaning of dual threaded cores you should have a look at the core block diagram. So it's not necessary that you can read everything here. So these are the typical elements of a hardware processor. For example you have an instruction unit here. You have a lot of key ring and so on and so on. So the interesting thing is here the coloring. This is thread zero and the yellow one is thread one. So you see that a lot of hardware elements here are available twice. So this is what NXP calls dual threaded core. So but if you have a look here you see gray shaded boxes. This for example is the vector unit. So this means our system has one LTVAC unit per physical core. So in total the T2080 core has four LTVAC units and that's the reason why this explains what we have seen before. So it's necessary if you think about deploying your application to cores to understand how the hardware looks like. It makes no sense to run eight threads with an LTVAC instructions on the T2080. And furthermore you should check your OS numbering scheme. If you do this you see here eight cores and they are grouped. So core zero and core one is one physical core two three is one physical core and so on and so on. But you check if your system uses this numbering or it's different. So if so it makes sense to run one LTVAC thread here, one here, one here and one here. So for example zero two four six or if you use say i use core zero for the OS of course you use core one and one of them here. So but there are more things you should think about. This is the T2080 block diagram. What you've seen before was the core block diagram. This was just this single unit here. So you see you have four physical cores. You have the L1 cache here and there is one L1 cache for each physical core not for each virtual core. So if you have two applications running on core one and core zero they share one level one cache. And the level two cache is here. It's one level two cache for all cores. So you might get interferences when your application is using a lot of memory access. And furthermore you have a lot of things here. You have here a lot of IOs. You have a lot of high speed IOs, serial data, serial rapid IOs, PCI express one or ten gigabit ethernet and so on and so on. You have a two channel DMA here. You have here in the middle a switching fabric. The core net fabric which interconnects everything. So the details of this fabric is not known. Even if you have an NDA with NXP you don't get the information. So but this is necessary to understand if you have interference or not. So in this case the only thing you can do, run your application, do tests, do a lot of tests, try combinations of interfaces, IOs and so on and so on. And additionally you have a queue manager, a buffer manager, a fray manager. These are elements which should reduce interference effects and enhance the throughput. So that's what stayed in the datasheet. But I don't know how this is working. So and now if you go to the T, 40 to 40, this looks similar than before. You have here the switching fabric, you have a number of IOs and so on and so on. And here you have again this four block and you have been seen for the 22080 but you have three of them. So the T40 to 40 is just a combination of three to T2080 on the same type. And another interesting thing is that you now have three level two caches. One level two cache for each quad block. So and this can be measured the interferences. So the algorithm we had before, if you run this on core zero and core two, so on this core here and on this core here, you get this execution time. If you deploy it to core zero and core eight, which is the one in the layer behind, then you have a speed up of around 50%. So there are the effects of the shared level two caching. Caching, that's a really big and complicated thing. And I want to dive into the details of caching. Just give you an idea to do some basic tests. Here we have the two ramp speed. Ramp speed is testing the speed of your memory. So here on the x-axis, you see the block size and kilobyte. These vertical lines here is the l1 cache, l2 cache. And you have the throughput and megabytes on the y-axis. It's again logarithmic scale here. So if you start with one core, you have this line here with throughput for level one cache. It's going down for level two cache and it's going down again for the memory. That's what you would expect. If you use two cores, of course not on the same physical core, then you get the blue line. It's constant here for level one, level two, and then it's going down a little bit earlier here than compared to the red curve. This is because the level two cache is shared between all cores. So and again you can go to four cores. It's again the same. The curve is going down a little bit earlier. Again, it's on. If you do the same with eight threads, you cannot deploy it anymore to physical cores. You have to use all virtual cores. And then you see this curve here. You see that you already have interferences in the level one cache. That's because two cores in the virtual core share one level one cache. Of course, you can do a lot of more tests for caching. You can also do the test with the two 40 to 40 board and so on and so on. This is just to give you an idea what could happen. So and of course, there are more things to consider. It becomes more complicated for the two 42 T 42 20. You do not know, as mentioned, the internal of the core net architecture, the buffer managers, none of these interfaces you have here. You have the DMA channels. So it's necessary if you know how your application works, what communication is that, which I always needed, how much access to memory is done, try to do reference implementation and measure this. So I know that the guy did this. So he did some bare metal applications so below the Linux and tried every combination of interfaces and to see the interference, you get a bit matrix all the interfaces and there are interesting things. Sometimes it goes down the throughput, but you can't understand it because you do not know the core net internal architecture. So of course, it is interesting that this is a lot of work, but it does not help you. You have to measure it. You will not get the information to you need to understand the internals. Or maybe you are a developer at NXP. Maybe you have the information, but for me, unfortunately, it's not available. And of course, there are more interference. You have three DMA channels and so on and so on. So my recommendation, if you want to go to multicore hardware, first of all, you should know your function and your non-functional requirements of the system. Sure. And you should understand your interface requirements. Then take out the data sheet, everything you have to try to understand the architecture of your processor hardware. I use this hardware here because it's a little bit more complicated than other ones, but even if you have a small arm system, the internal processor architecture is not that easy. If you don't have software, try to do reference implementation. Depending on the application, it could be algorithms, could communication hardware, or you should know what your application does. And then think about how you deploy your applications, your tasks, your threads to dedicated core. You have seen before that it's important to know the architecture and say, okay, I deploy it to this core and not to the other ones. This could have a great influence on your performance. And then test your application, test that it's working as expected, functional requirement, and of course do some timing measurements. So, and if new results are unsatisfying and you still think that your hardware suites your needs, then do iteration. Read again your requirements. Did you understand everything correctly? Have a good look at the process architecture. So this is an iterative process. So now I'm at the end of my presentation, a short summary. We have seen depending on system requirements, when a Linux kernel and applied 3M patch can be used on multiple systems, even if you have real-time requirements, not out of box, but there are a lot of configuration parameters provided by Linux to adapt your system to your own needs. Of course, the system and the software designers need a good knowledge about the hardware architecture, the processor. And depending on your processor architecture, you should really think about the deployment, which threads tasks should run on which core laboring or not laboring cores and so on and so on. Some applications maybe do just a little bit messaging, other do perform algorithm, need for high memory throughput or not. But that would be your system and you should know what your system needs. So I hope you have seen that RT Linux on embedded mild-core system is not something magic. It's not necessary that you are a guru to do that. And of course, this shown setup here is an example for a dedicated combination of hardware and software and it's not an exhaustive analysis. So your system may look different or completely different. You may have different requirements, different three-time requirements and I hope the presentation would encourage you to try this out. It's fun. Thank you for your attention and now we have some time for questions and answers. Yes, please. One moment. Okay, so the question was if we have tried to run real-time tasks and non-real-time tasks on the same multiple hardware and different cores. No, we didn't. So the attempt here was just to say we have only real-time requirements for application and we try to find out how to configure the system that we can run it here. So of course, there are a lot of combinations. There are also a lot of different real-time Linux. This is just one we used here that you can use something else. So this is a lot of work to do, but actually we only did what I showed here. Some more questions? I think there are some tests between the jitter. Did you use tracing levels? Sorry? Sorry, did we? Have you used F-trace like in the cyclic test to determine why the jitter was happening? Have you tried looking at tracing this? Tracing, I have to tell you exactly why. You don't have to guess and say, oh, let's try it. I still seem to use this. Let's try this. Actually, you could find out exactly what is the issue. Okay, so the question was, here's where I showed how big the jitter is for the different course. If you use tools like F-trace, for example, to understand what's going on. Yes, we used F-trace for some measurements. I didn't show it here because the time is limited. But the idea what I wanted to give you is here. We just saw the effects and then we had a look at the process of hardware architecture to understand things. And I wanted to give you the idea that it's necessary to understand how your hardware looks like. Of course, you can try to do F-tracing and see what's happening, what's behind there, and what is generating what peak, for example, in the latency measurements. We did a lot of work there, but it's a little bit limited for the timing. Yeah, but if you did a multicore, that's actually why I worked on it. The scheduling, the communication between the scheduler could actually cause latencies, lock-intentions, and stuff like that, to find out whether or not it's lock-intention and not hardware, then it's a software play, a hardware issue. Okay, so we did not distinguish between hardware and software effects. So the idea was we take Linux and the preamp page out of the box. We don't want to modify the scheduler at additional software. Just see on a higher level how our application behaves and what effects could arise. Of course, you can dive into the details, understand which is caused by software, which delay is caused by hardware, and so on, and so on. That's an interesting thing. Well, it's more than actually interesting. If you actually find that software, please report it to the RT user's mailing list, and we will investigate it and fix it. Because if you're seeing that, someone else is seeing that too. So that's the type of feedback we kind of want. So if you see something like that, please, RT user's mailing list, feature.co.org. Okay. Good. So are there more questions? No questions? Yeah? Thank you very much for your presentation. It was really useful. My question is more like general. So we know that cheap manufacturers like to pump numbers. So you have a lot of cores because they want to market their product. For your specific application, so what was the real solution for the problem? Using physical cores. So really, are you just exploiting four physical cores? Let's talk about the smallest hardware you used and not eight. So at the end of the day, if you want to use vector processing, what is the final judgment that you want to tell to a cheap manufacturer? So at the end, so we had a setup of a lot of single core processors communicating a lot of things. So the complete system is much more complex. So the algorithm I mentioned there is just one small part. There are a number of applications, just some of them just are monitoring a little bit hardware. Some of them are communicating and so on and so on. And they only pick out this algorithm stuff because this stuff is time critical compared to the other software. And this was important to see can we meet the real-time requirements we have or not. So this is an ongoing investigation we do at the moment. We don't have a final solution that we say, okay, we need this CPU or this CPU, the T2080 or the T40240. How many CPUs can we put on one BME board? How many boards we need? How we can feed in all the IO interfaces? Is there enough space on the front panel or on the back side and so on? So at the end, it's still an ongoing research. So I wanted to give you the information what we did up to now that it's not that easy, but you can deal with multiple processors. Not in any case. So maybe your system looks different. You have requirements of less than 10 microseconds. Then you have a real problem. You cannot do it this way. Okay. Does this answer your question? Okay. Thanks. So some more questions? Okay, if not, thank you very much for being here.