 Can you guys hear me? OK, great. Thanks. Well, thank you all for coming today. My name is Paul Thomas. We'll be talking today about some latency benchmarking of the Cortex A53 processor from ARM. So I'm from AMSC. We're founded in 1987, headquartered near Boston, Massachusetts and in Air, Massachusetts. And we specialize in the design and manufacture of power systems and superconducting wire. I don't get too involved in the superconducting wire part, but it is very fascinating. As we go along, if you guys have questions, I know it's hard with the microphones, but feel free to raise your hand and interrupt. And I'll try to repeat the question or get you a microphone quick. I would rather, if you have a question, I'm sure other people have questions. And so just raise your hand, interrupt, and we'll talk about something for longer. So feel free to do that. So today, I'll go through the software and hardware setup for these benchmarks. Then we'll go through some of the basic latency tests using the cyclic test tool. It's a common tool to just test the standard latency, but it doesn't have any external interrupts. And then we'll look at some ping pong latency tests between two separate boards going back and forth. And then finally, we'll look at a real-world ADC interrupt latency. Actually, it's an analog to digital converter, but ultimately, the interrupt is a DMA complete interrupt. So we'll take a look at that later on. So why real-time Linux? Stable and supported code base, we all know that. That's why we're here. Deep APIs, so oftentimes, you're in the real-time context, you're kind of trading off between the breadth and depth of the APIs and the performance of the real-time. And with preempt RT, you still have good latency performance, but you have all the Linux APIs that we know and love just within one system. I think some of you will be at the real-time summit on Thursday. We'll get an update on the real-time Linux collaboration project, which ultimately aims to mainline the preempt RT. I'm not involved in that, but looking at the real-time looking at the performance of that. So the Cortex-A53, why is that an important part? We're interested in it because the Zinc Ultra Scale Plus parts use that ARM core. It's a 64-bit ARM core, ARMv8. And then there's lots of other parts that are based on this, including the Raspberry Pi, the Model 3. I just listed a couple on here. I guess it's NXP now. It used to be Free Scale. It has the IMX8 and the Odroid C2. It's a very common ARMv8 core in the embedded marketplace. You're probably not going to see too many Cortex-A53s in your phones these days. But maybe a few of the older phones would have them on there. So the hardware setup. So this is comparing two different boards, one based on the ARM Cortex-A9 and the other based on the ARM Cortex-A53. The actual module we're looking at is the Enclustra Mercury XU5 module. It's based on that Xilinx SoC. It's an 8-stage pipeline, 1.3 gigahertz clock speed. Contrasted with the Cortex-A9, it's several years older. The board for that test is the Z-board from Avnet and others. And that's the core used in the original Zinc parts from Xilinx. So this is kind of the evolution of the Xilinx parts. And so it's kind of interesting to look at the performance there. One thing to note when we get to the UDP ping pong test, the Z-board does not have a second ethernet port. And so those tests were the second ethernet port could not be because it wasn't there. It was just shared with the same ethernet port that was used for SSH and the rest of the system. And so it had other traffic on it besides the UDP traffic. So it's a 10-plus stage pipeline and 666 megahertz. So you can see just from the specs that the Cortex-A53 should be a much higher performance. Part the kernel. So the starting point for both of these was 4.18. So it had the preempt RT patch came out shortly after that. And so it was the very first one that RT1 patch applied. You can see the URL there. And then for the Zinc MP, that also had this firmware and clock driver patch applied. So Xilinx is actively trying to get all the new features upstreamed. And some of those features are just needed to run. And so I had to pull off that patch. And then beyond that, there were a few more hurdles to get through to get the latest kernel to run on the Zinc MP. You had to have the lower level of firmware. So the first stage bootloader and the PMU firmware and the ARM trusted firmware all had to be up to date. And so there was a little bit of work getting that all up to date. And then a few more things that the Xilinx guys helped with a little bit. But that should all be as the patches are mainlined, then future versions should be. And as the standard, the regular software releases from Xilinx come in. That should be a seamless process in the future. OK, so here's the first result. So this is the basic cyclic test results. So you can see the Cortex-A9 had a maximum latency of 54 microseconds. And Cortex-A53 had a maximum latency of 17 microseconds. Normally, in the real-time latency testing, we're just concerned with the maximum. I did put the mode on here just as a point of reference for the peak on the. Yeah, so you can see, I put the mode on there just so you can see kind of where those peaks are. So those peaks are coming in at 19 and 7 microseconds, respectively. This is very similar to a kernel I ran several years ago for the Cortex-A9. But these are both with the 4.18 kernel. And you can see it's about three times the Cortex-A53 is about three times faster. So let's see. So let's talk about the setup a little bit. So for the setup, CPU sets were used. So that's a way to shield specific cores. And so the kernel portion of that is CPU sets. The user space management is via CPU set, and that's a Python application. So it's an effective way to shield one or more cores from scheduling ordinary tasks. So if you have a real-time system, the system has to do all of its normal things. It has to have its SSH server. It has to do all of its housekeeping stuff for the system. But then if you want to run real-time tasks as well, you may want to dedicate one or more cores for that and to ensure that the scheduler isn't scheduling other things on those cores. You can use these CPU sets. So here's the test configuration. So we basically set up in a shielded way. So CPU 3 was the user set, and then the CPU 0, 1, and 2 were the system set. For the Cortex A9, it's only a dual core, so the system set was just one core. So just one less there. And then so the actual loading. So the system set had cyclic tests with a priority of 98, cyclic tests with a priority of 98 running, and cyclic tests with a priority of 99, and then a stress program. And the stress, it had eight CPU hogs and eight virtual memory threads. And so those are trying to kind of be memory-intensive and CPU-intensive stress. The user set, you could set this up different ways, depending on what the real world load was. In this case, I only gave it a loading of a lower priority cyclic test. So it had a cyclic test priority of 98, and then the actual test itself, which was the results. The results from the graph are the cyclic test with a priority of 99. On the shielded CPU core. Yeah, and you can see the actual stress command there. Then the UDP ping pong test. So this setup was, so here you can see, so this was run two different ways. This was one run from the Cortex A53 to another Cortex A53, and that's the setup you see here. And you can see this ethernet cable here that connects them. That was actually a separate ethernet port from where all the SSH traffic was. And so it only had the UDP traffic of this ping pong test on it, and so yeah, this is from the same board from the Cortex A53. For the Z board, I don't show that setup here, but it's just with a single ethernet cable. I ended up not using the CPU sets because it was affecting the performance a little bit. That's something that's worth investigating. I wonder if I had certain, I wonder if there were K threads from the networking that were affecting how the isolation was working. So I'm not exactly sure, but I pulled off the CPU set stuff, but the loading, the stress loading, and the cyclic test loading was all still the same. So it all had a, it was the processors were loaded and there was a cyclic test running with a priority of 97 in this case because the IRQ, so the ethernet IRQ was given a priority of 99. The thread or the processes were given a priority of 98 and then the cyclic test loading, just for background loading was given a priority of 97. So you can see the performance here, the Cortex A53 is still reasonably fast. That 168, if you wanna think of the one-way path, you probably think of half of that 84. I didn't put it in, I put it in round trip just because when you're considering the blue one here, that has, it's one half of the Cortex A53 and one half of the Cortex A9, so it's not really half of that. It's, you could break it up like that, but I just left it as the raw full round trip for both tests. And so this is quite a bit longer than the cyclic test, so for the maximum of kinda 84 there, but still very, very respectable. The Cortex A9, you can see it kinda has these stragglers and I didn't really have time to investigate what the cause of that was. So these go out above 800 microseconds. 800 microseconds was just the largest bend I did for the latency capture. And so I wonder if there isn't something going wrong to have that continually go out really high. So that could be investigated because certainly the cyclic test performance by themselves, even though it was higher than the A53, it was still bound. You could kinda still see the clear maximum at the top there. Yeah, 800 microseconds was the largest bend. So a real world test. So this is almost a real world test. So I'll show you the setup in a minute here, but basically an analog to digital driver. Using the industrial IO subsystem of the kernel, I think there's a talk later today from the maintainer of that. So check that out if you're interested in the industrial IO subsystem. This is the test here is DM8 engine-based and then the performance is captured using a hardware timer. So I'll kinda go into how we actually capture the latency performance in a minute. It's a little bit tricky. So just a quick blurb on industrial IO. How many people have used industrial IO in the kernel? Okay, so maybe a third or so, a quarter. It's a extremely useful subsystem. You can get, there's many pre-existing drivers for analog to digital converters or digital to analog converters, accelerometers, a whole slew of different devices. It can provide text formatted measurements, just a sys file and you can see the measurement or you can kinda do, there's a ring buffered version where you can actually read in the raw data from a C program. You don't, if you're dealing with a lot of data, you don't really wanna go to a text string and then back to an integer or float. Okay, so here's this test setup. So in the zinc MP parts or the MP SoC parts, they have both a programmable logic, so they have an FPGA portion and they have the processing system. So you have both of these pieces and the programmable logic, you can do all sorts of things with it, but one thing you can do with it is you can kinda decouple kind of very fast, low level housekeeping of, in this case, an SPI bus or actually a whole bunch of SPI buses. And so you have a, in this setup, I didn't have external hardware, so I was just using kind of a simulated hardware in the programmable logic. So you have a simulated A to D converter and then that's through SPI connected to this controller, the A to D controller. The A to D controller has a SPI, it kind of spits out an AXE stream and that AXE stream connects straight to the DMA block. The DMA block, so this DMA block has a corresponding DMA engine driver in the Linux kernel. So Xilinx has a DMA engine driver that can directly connect to this DMA block that's in the programmable logic. Yeah, it's, this path isn't really important, but it's, oh no, that is the main data path. It actually also has a configuration, it has a configuration section that's not shown here. And so when that interrupt, so when the DMA is complete, there is an interrupt that goes to the processing system and so I just parallel that signal into a timer capture block. And so this timer capture block, it has a free running timer and then when it sees that interrupt rising edge, it just kind of saves off that free running timer for future use and so that's, I'll go into that more in the next few slides, but that's kind of how you can use the hardware block to capture the latency of a external. Even though this is all within the chip, this is kind of, this is all external to Linux. Linux doesn't really know anything about programmable logic, so that's all kind of external system. Okay, so then what's the latency there? So the maximum latency was 30 microseconds. So not quite as good as the cyclic test, one of, was it 17 microseconds or down in that range, but still, still, still very good. Let me, I think, yeah, so let me go into this more detail on that, timer capture function. Yeah, so hardware timers with a capture function are common in SOCs and microcontrollers. I'm sure many of you have used, often times timers will be used for like a PWM output, but many times the same timer block in a microcontroller also can be used for this capture function so to very accurately, essentially time stamping an external event. So upon the trigger event, the present value of the free running timer is stored in the load register. So in all the different microcontroller vendors, they all, you know, all the peripherals are implemented a tiny bit different, and so, you know, just you have to look at the how your specific timer is implemented and then you can kind of use it to do similar functions like this. So in the, so then in the kernel ISR, so the DMA engine has a callback. So in the IIO driver, the industrial IIO driver, when you set up the DMA, you register the callback. So the latency that we're measuring is from when the hardware interrupt happened to when we're actually in the DMA callback within the IIO driver. So all the DMA stuff before that callback has already been taken care of. So you could push it upstream a little bit and try to get the latency kind of when it very first enters the kernel. But this test was for when you're in the callback in the driver. Is there any questions on how that how that hardware capture works? Okay. Conclusions. So the Cortex A53 is a very low latency core. Using the programmable logic to decouple the SPI bus is very effective. So oftentimes when you have an A to D converter like this, you would have several different things going on. Normally you would, you can still do a DMA transfer and the work that the kernel is doing is still relatively minimal. But oftentimes what you have is if you have an external analog to digital converter chip, it'll have a signal that says the conversion's done. And so you could easily just use that conversion done signal, you get that interrupt within the, say you're using the IIO driver, you get that interrupt within the IIO driver and then you say, okay, the conversion's done, go grab the transfer. And so then you issue the DMA command. For the SPI bus or some other buses, you can still use the DMA. And then when that DMA is ultimately finished, you can get that interrupt within the driver and actually push the data to the ring buffers. But you're dealing with at least two interrupts there when what you really want is you just want the single interrupt when everything's all transferred. And so you can parallel a whole bunch of different converters and you can have the programmable logic deal with all the low level stuff. And then you can just let Linux, you kind of notify Linux when everything's done and transferred to memory. And it's very effective to have minimal latencies doing that. Future work. So investigate the UDP path latencies. So those seem very high. And I'm guessing that is a lot to do with how much processing the TCP-IP stack within the kernel is doing. And so there maybe there's ways that we can either bypass that stack or speed up that stack. I think we heard Jonathan talk about the AFXDP on Monday. So that's one option. Yeah, and investigate the difference between the cyclic tests and ADC driver results. So that we did see a little bit of a maximum. So those are slightly different interrupt paths, but they had different maximum time. So it might be interesting to kind of dig a little deeper there and see what's causing those differences. Track preempt RT across kernel versions. So this is only one snapshot in time. And so I think it would be interesting for a specific setup to kind of track that as the kernel versions move along. Let's see. So just a quick thanks. Rajan Bhaja from Xilinx. He helped us get up and running with the 418 kernel. The boards from Enclostra. And then I wanna thank my wife and family. So that's all I have. So are there questions? And so I've been working with a mainline kernel for your test. So can you recommend now using mainline for Sync Ultrascape plus MPC SOC? Because I'm always struggling once in a while to try the mainline, but then I go back to the Xilinx tree, which I really don't like. So you seem to be successful. Yeah. So yeah, I was successful getting the 418 kernel to work. It was a little bit painful at times. And so I think without that extra help, I'm not sure I would have gotten there. They are actively incorporating those patches. So the biggest one is that clock and firmware one. So I haven't checked in the last week or two. And so once that gets incorporated into the mainline, that's the biggest one. And then the other big thing is that all the low-level firmware pieces get up to date. So all I think in Xilinx were all like 2018 too. Everything has to be up to 2018 too to run. I would be really interested to build it like you. I can provide, if you email me, I can provide you the kernel I'm using. It's besides those two patches, the modifications are very subtle. There's not much going on. But if there's still a gap in time, if there's the next three or six months when it's helpful to have that specific kernel out there, I can provide it to you or push it to GitHub or something. Okay, that's cool. Thanks. Did you at any point run your ping pong test on the A53 using a single ethernet port? Because I'm wondering if the long tail of packets could just be that you'd get caught behind a few 1500 byte frames and that'll add hundreds of microseconds straight away. No, I did not run it with a dedicated ethernet port. So, but I don't think, I wasn't running anything that would have large packets on the ethernet. I mean, it was just SSH traffic. And so it was just terminal traffic and it wasn't even, it was like a few bytes at a time. And so there wouldn't have been any large packets that were causing issues, but it definitely could be congestion, not congestion, because it was a tiny bit of the bandwidth, but there could be some dependencies there for the zip order. It might be an interesting test to run the A53 on one port and see if you get that same spread on the tail. Yeah, yeah, that I did do that just a little bit and the performance, the performance wasn't as good, but it didn't have those unbounded tails. So for the tests that you did with the Zenboard with the single ethernet port, when you showed the pictures with the A53 you were using a dedicated cable between the two secondary ethernet ports. Yes. For the test with the A9, did you run that to just directly to another device and SSH from that device in addition to running the test from that device or were you running it through some sort of hub or switch? So let me bring up that. What was on the network that had the A53? Just that and the device it was round tripping with or also some laptop you were SSH-ing from. So when it was the A53 to the A9, everything, all three ports were plugged into a local gigabit switch that also had other stuff on it. It would be interesting to try it directly from the A9 to the A53 without any switch involved and then you could always SSH from the A53 to the A9 in order to do the test or just set up an automated test that doesn't require any activity. Yeah, there's several other ways to do that. Another one is I have USB ethernet ports. I mean that would introduce more confounding factors. I was wondering about if you could have less confounding factors by just having the one ethernet for the test. So yeah, and the main reason I didn't do that is because the boot on all of these was tied to a TFTP server. I see. And so the TFTP server wants to be there so it can grab that. So that's why the setup was like that. For the other cyclic test measurements, you said you have used two instances of the tool with different priorities. Can you please explain what's the reason for that or if it's a usual way to do it? I just wanted a lot of internet traffic. So the interrupt traffic at a lower priority seemed like a good way to kind of mimic a lot of stuff going on. So the cyclic test on 89 priority is just some kind of stress tool in that case. The, let's say, I think. In 98. Yeah, yeah, this one is another stress basically. It's another stressor for the processor. And another question, do you perform one long measurement with a lot of measurement points or have you done short measurements and restarted multiple measurements and accumulated the results? This was just one long measurement. And if you repeat that measurement, does it depend on the starting condition of a measurement or have you tried just one long measurement? I've done it several times. It doesn't vary much. Okay. I mean, you can, it's fun to watch because you can, you can, hold on. Let me get that chart up here. So, you know, you can, most of the time, so most of the time you don't get the highest point right away. And so, you can, you can see it fill in and then it'll like bump up one to the next microsecond in and then you run for another half hour and then bumps up to the next bin. And so then, but by, you know, the end, it's not moving much. It's not, the maximum isn't moving at all. Okay, thanks.