 Hello everyone, I am Vaishnavachad from Texas Instruments. I would like to talk to you today about picking device drivers for achieving real time or deterministic performance in embedded systems using RT Linux. I work with the Texas Instruments Linux development team, primarily working on Linux and Uboot for TI devices. I also maintain the TI platforms in SAP RRTOS as well. TI has a strong history of open source collaboration, and we develop long-term sustainable products that focus on open source ecosystem in the device architecture phase itself. I have with me, Vignesh. Vignesh is a software engineer at Texas Instruments India. He commandeers the TI-AM64 resources in mainland, along with a few other device drivers as well. Also, Kirti, who is a significant part of the work, he is attending virtually, and he is a software application engineer in Texas Instruments India. So this is the, like, our view of the talk. So RT Linux is being used more and more in embedded use cases, and customers and users expect MCU-like performance from RT Linux. And traditionally, in use cases, where an external MCU was used, those are getting replaced by an application processor running RT Linux. So we'll just go through the first, what the problem we were facing on our DRA devices. We have multiple spy controllers, and using RT Linux, using spy in the DMA mode, with RT Linux enabled, we were seeing poor performance in the spy target mode. Spy target mode is the mode where we are the receiving entity, or we receive the clock and the chip select, and spy host mode is the controller entity where we initiate the transaction. In the target mode, the RT Linux host does not have control of when the transaction is getting initiated, so we need hard real-time capabilities to process or not have any packet loss introduced. And we were seeing also another issue in spy host mode, where we were continuously initiating transactions. We were seeing that, we were seeing latency spikes. So almost all the nominal transactions took 250 microseconds, but some transactions within a few minutes takes, goes to a five milliseconds spike. So this analysis, we are trying to solve the problem and make generic suggestions on how someone can extend to their own solutions as well. So why do we care about these latencies? So now RT Linux is more and more getting used by customers and users in embedded use cases, where a traditional MCU was getting used, and the expectation is to have an MCU like real-time performance, so that they can have a real-time deterministic performance and also the versatility and the flexibility provided by the high-level Linux OS. And this demand for the deterministic performance is very high when we are interfacing with external peripherals in an embedded system, and this demand is getting even higher when the RT Linux host is not the controlling entity for the external peripheral. The external peripheral starts the transaction and the host does not have control over when the transaction can happen. So we need real-time capabilities and with the common embedded buses like SCAN, SPI, UART, or the ones that request these deterministic performance. And of these, SPI is the simplest one and the most popularly used one because of the simplicity and the low cost. So these are some use cases that we came up with that were getting used with SPI. So in industrial robotics, we saw ladders were getting interfaced with SPI. So in this case, the host processor running RT Linux is the controlling entity or the one that initiates the transaction. So we have all control over when the transaction starts, but since it's a robotic application, the system demands some deterministic capture of that sensor data, which is very critical for the system functionality, like taking the leader data and performing the motor control. Then we see use cases where the host processor emulates the SPI target on it. And in these use cases, it is typically being used as SPI is being used as a device management interface. So it's used like there will be an external supervisor which used to control the host running RT Linux and it's getting used for different purposes like firmware management and also like an external watchdog doing a ping-pong of the RT Linux host. So we know that it's important to look into these latencies in the typical embedded systems running RT Linux. And so we will go through the steps that we went through to analyze and reach our solution. So the first thing we would do before going into the device drivers or analyzing an RT Linux latency issue is that ensure that your platform can run RT Linux and get your expectation correct. So first thing would be to get your platform, get working with the preempt RT kernel. The RT Linux wiki has all the details for it. So this was also discussed in multiple talks in RT Linux mini-conference and ELC. So the first test that we run is the cyclic test. So this test measures the threads intended wakeup time and the actual time at which it wakes up. So this is how we get our expectations correct and we get our overhead correct for our future budgeting in our RT stack. So this is the example command we run in one of our TI K3 platforms. This was before the next thing, but we wanted to keep it as it is so that it's not overlapped. And from Thomas' comment, it's actually fixed, right? So yeah, yeah, but apparently if we go to the next slide, the OSDL benchmarks are still running with the same command. So this is what we get when we plot these results. And so you can see that the almost most of the threads wakeup time is between 30 to 40 microseconds, but still there are some excursions to 70 to 80 microseconds. So from this plot, you get your expectations correct and you can get an understanding of what your particular platform performance can be when you run RT Linux. And RT Linux or the real-time or the performance is something like a system concept and it's not, we need to ensure that the complete system is tuned or the perfect for this particular workloads. So there is a tool called LM Benz, which is used for DDR bandwidth and latency analysis and it's used for identifying any bottlenecks in the hardware. So all these tools have been discussed in their own particular talks. I am just going through what generic steps, what you can do to ensure that RT Linux runs on your system and what expectations you can have. Then there is RTLA TimerLat. It's a front end to the TimerLat tracer and it helps you identify the scheduling latency in your system. And once you have all the benchmarking and all done and you have your expectation correct, then next thing you would need to check is your configurations. Let's say you have some extensive power management configuration on your RT Linux system, then it will hinder your performance or extensive debug configs. So that would be the first thing to check then the real-time policies, then you have multiple tuning knobs that you can have as part of RT Linux, then you can, even the application process, you can bump up the real-time priority and all these details are available in the RT Linux Viki as well. So in our case, we fundamentally had issue with our device drivers, but if you have some issue with the device drivers, even if you have all these tweaking knobs, if you have a weak path or a slow path in your device drivers, even if you tweak everything using all these options available, you might not be able to get the best performance. So, and it becomes more complex when you have more than one subsystem interacting, like you have this pie in DMA mode where you have this pie core subsystem, then you have this pie device driver, then you have the DMA engine subsystem and the DMA controller driver. And also in this case, we were using user space to trigger the spy transactions using the spy driver, then user space application also plays the role. So this debug becomes a little bit complex when you need to have an overall picture to understand where exactly the latency is. So I just wanted to give a quick overview of what is why almost everyone would know it. It's a simple embedded interface, synchronous interface, where you have the clock, then you have the data lines and there is a chip select. So spy can operate in host and target mode. Host is the entity that triggers the chip select and starts the transaction. And it's also the same that provides the synchronous clock. In Linux kernel, we there was historically there was support for spy host mode or the controller mode. And we forwarded it in, the spy target mode support was added and six spy controller support target mode as well. So why do we take or discuss this spy target mode in particular? So because if you take UART or I2C, it has its own flow control mechanism. So UART has ready to send flow control signals which helps to introduce a flow control between the sending entity and the receiving entity. We have something like that in I2C as well, but for a spy target host target, there is nothing as no standard flow control mechanism as well. And the next issue is that the transfer is full-duplex. So the RX and TX packets are sent at the same time. And in Linux, if you are running on a spy target mode, you cannot have the response for the incoming packet in the same packet itself because RX and TX, it happens in the same packets. You will need to have more than one transfers to have the proper response to the sending entity. So these are some challenges with spy target mode in Linux. So I told earlier that we faced the latency issues. The first thing that we did when we faced the latency issues was to analyze what's happening in the spy subsystem. So are there any work use or tasks in this spy subsystem, core subsystem that can affect our performance? So what happens in the spy subsystem is that we have the spy client driver that pushes the spy transaction messages to the spy message queue. This is a controller specific work queue. And then it goes to the spy controller driver. The spy controller driver performs the setup, then initiates things like the DMA setup, and waits for the completion of events. In the CPO mode, it could be the interrupts. Then in DMA mode, we will be waiting for the DMA completion as well. So then we went in to look at what actually happens when you perform a spy transaction with DMA. So when the user space initiates the transaction using its spy driver, so we call the spy message Ioctel, then it goes to the spy core subsystem, then it goes to the controller, it performs the setup. Then it queues the DMA operations to the, it sets up the DMA, then the DMA engine configuration is called, then the specific DMA controller driver is being set up, and this transaction is queued in the DMA controller queue. And on the hardware, we perform the setup and the transaction happens. So this is what happens during a spy transaction with DMA enabled, when triggered from user space. So earlier, we saw that like, we saw a large latency. So we captured the trace using traceCMD and analyzed it using kernelSAR. So we saw that the RX happened within 800 microsecond and at TX, it took 850 millisecond, which was very large. And if you look at it, it was a full duplex transfer. The RX and TX actually happened in the hardware at the same time, but we got to know it 850 millisecond. So this is one of the worst case capture we had, like some 100 milliseconds as well, but compared to what the transaction size was and the clock speed was, this was way of our expectations. So then we found that the delay, the most of the delay is getting introduced by the DMA controller driver. Initially, we did not have a suspect of the DMA controller driver because we were just queuing the DMA requests and waiting for the DMA completion in the spy driver. Looking at the spy driver, we could not find this issue, but looking at the traces, we can see that the most of the time is getting taken with the DMA check TX completion. And the spy driver is waiting for that completion as well. So then we found another option. So we discussed that the spy core already has a message queue for the each controller, and there was a way to bump up the priority for this message queue to make that the message queue, the work queue real time priority. But in our case, the delay was already introduced only by the DMA driver, but I'm just discussing this in case someone wants to. So we can look at subsystem level tweaking option to change the work queue priorities and all. So then we analyzed our spy controller driver. So then we found that in looking at the spy controller driver, the issue is not very evident because we are just queuing the DMA request and waiting for the completion at the spy driver. Then we had a look at the DMA engine and the DMA driver. So we saw that the typical operations that happens, we submit the DMA request, then it goes to an async issue pending queue. Then we wait for the completion and once the completion happens, we call the callback from the requesting entity, in this case, the spy controller driver. So there is a very two level deferred work queue setup in this DMA engine driver that we found out. And so we saw that most of the delays was there in the UDMA TX completion. What happened was that in all our SOCs, we had something like a network on chip architecture where there are multiple DMA controllers. So there is a central DMA controller and there are small DMA controllers nearest to these peripherals. So what happened in this deferred tasklet was that we were checking the status of these peripheral DMA controllers and checking whether the complete transaction, ensure the pipe is flushed before marking the operation as complete. So all the issues in our case was caused by this tasklet or deferred work queues and in a generic case, we were having a non-RT code path in where we expected an RDP performance. So just to add the way the DMA works is that once the DMA transaction is done, there is a tasklet that gets scheduled, which then informs the client that submitted the DMA request that the transfer is done and does it all back to that. So that's one level of tasklet and it is per channel and it depends on number of channels that are in use and it's kind of a shared, there are multiple threads as you can imagine and they are all competing for the same set of resources in the system. So I mean the scheduling time that is. So to add to that, we have described in the second of the architecture that we were working on had an additional layer wherein we had to monitor multiple DMA endpoints to be able to successfully say that the transaction is done which is in combination of IRQ and a threaded IRQ bottom handler. So basically we had at least three threads in the system, one is a spy message queue which is where the data is coming into the driver. There is a queue through which the submission is being reported back to the driver and a queue within the DMA driver itself that's also playing the role. So there are at least three or four threads that are competing here to get the things done within a given amount of budget that we have. Sorry, go ahead. Yeah, so these are all the three places that indicated the latency graph like we just mentioned. So what we did was just tune the priorities, make all these pipe controller pump messages to the work queue as real-time priority, then make the DMA tasklet as real-time and higher priority and also convert the work queue within DMA driver to real-time priority. So even though this reduced the jitter and brings in a slight level of determination but we still had these deferred tasklets and all the non-RT code in the real-time, RT path where we are expecting real-time performance. So the best thing to do would be to eliminate these non-RT paths and eliminate these deferred tasklets in your code. So this is what we did to make the DMA driver work queue a real-time, then we saw how we made this pipe controller work queue also real-time, the pipe subsystem already provides an option for that. So then we discussed with the hardware engineers and found out a way that our pipe controller already has a way to provide an event that gives us a local understanding of the completion. So we can use that to eliminate all these deferred tasklets and completion in the real-time path. So earlier we were relying on the DMA, all this, we were checking up on the small DMA, peripheral DMA controllers and getting the understanding of the completion, but now we are programming the pipe controller to provide an event upon shifting out or shifting in a given number of words. So in this case, we have all these fixed size, this works for all fixed size packets and this was the case when we use pipe target mode always. So this is what happened, so in our new flow earlier, we were setting up this pipe controller then queuing the DMA transaction, then we were waiting for the DMA completion, but if you look at the DMA system, we have the data in the memory, then the DMA controller takes it to the pipe controller and pipe controller shifts it out. And if you look at it, so if the pipe controller could shift it out, then we already know the data has already reached the pipe controller. So if the pipe controller could shift the data out and provide an event, then it is already understood the DMA transaction is already complete. So in general, what we need to look for is if there are different tasklets, we need to somehow make, first thing would be to eliminate that task or bump up the priority for those. You want to add something? Sorry, so just to add to that, the point here is that we are trying to eliminate the amount of work that is done in a real time path. So defer all the works to be background tasks. The general way the driver is written may not always be the best choice when you're working with an RT kernel. So the general way of writing a driver is you would wait for a signal from a subsystem to which you have submitted a request for you to indicate that it is done. But because of the nature of how the Linux driver subsystems are, how we are interacting across the subsystem, there may be quite a bit of latency by the time you do get back the response. So the idea is that when you are writing an RT path and we are creating an RT path, you should kind of try and keep it as minimum as possible so that you can achieve much more determinism in the path. So that's the point that we are trying to bring out here that we had to kind of rewrite the spy driver that we had to look at the signals, to look at the indications that are available from the hardware at that point in time to kind of get the indication of that whether we are ready to take up the next request or start servicing the next request without always waiting for the round trip from the DMA subsystem for us to indicate the same. And you could always ask that we could entirely eliminate the other subsystems such as DMA, for example, here, but that has its own challenges. For example, with a spy target mode kind of a new case, the CPU just can't be quick enough to put data at the rate that the spy controller needs it. You may have to buffer it up beforehand, right? So that's one thing and the CPU itself is also working with other user space applications to get the processing done and get the data out as well, right? So basically the interdependencies would still be there and you kind of really have to make sure that the path is as small as possible at this point. Sorry, go ahead. Yeah, with all the new changes, like when we had rewritten the spy driver to use its own event and not rely on the DMA callback, so we saw significant changes in the performance and where we saw kind of 850 millisecond delay, which is very poor. We were kind of seeing some more realistic numbers kind of 800 microsecond per transaction, which was very well within our expectations. So and in RT Linux cases, we can bound this latency to what our expectations is, but to increase the robustness of the system, we discussed earlier that there will be an external entity that's sending the messages to the RT Linux host and there is no standard flow control mechanism. So we used a simple GPIO based ready signaling so that the target can signal to the host that it's ready to receive the message. So when we are using the controller in target mode, we need to have the message already queued because once the chip select is toggled by the host, we need to, and once the clock is set, we need to start pushing the data through the spy lines. So like we use something like a ready signaling from the client or the target to signal that it's ready to, and it's ready when it's queued the transaction. This is like increased robustness in an RT system and also it makes this pie target use case or more realistically possible in non-RT systems as well. So once we check the traces after making these changes, no, we are not relying on the DMA controller interrupt. Now we saw that the completion was happening within 800 microsecond, which was our realistic values. And we saw earlier that the RX was already happening at 800 microsecond, but only in the TX path, we were understanding that it completed after a long time due to those deferred tasklets. So this is what we changed. We eliminated UDMA, they checked completion path and on the DMA engine, we made some changes. So usually on all the DMA transactions, when we perform the DMA transaction setup, we will always be requesting for callback by default. So we made a change to not request the callback and rely on a local understanding or the controller level understanding in this use case. So this is what we saw earlier. After making this change, we saw that in host mode, I did not speak much about host mode here because as part of making this change for target mode or relying on the controller interrupt, we could see that the latency spikes that happened due to the deferred tasklets, that the root cause for that was also the deferred tasklets. That also went away and we saw peak latency staying less than 400 microsecond, which was realistic for the robotics control use cases. So I just wanted to show from our learnings, what the first thing to do for any particular system would be to first get the system, we can ensure that the system can work RT workloads and also get your budgeting and our expectations correct on what delays you can expect from the scheduler and also keep the hierarchy of state to minimum. And also in our case, we had the RT task that's despite transaction waiting on low priority work use on the DMA controller side. That was the root cause of all our problems. So we should eliminate those kinds of situations and just multiple level work use always kills the deterministic performance and we should try to rely on the hardware state as much as possible to make the software complex or make the RT path very simple. So if that kind of hardware features are not available, we should communicate that to the hardware designers as well. So these are the references we use. So we also did some testing with the, usually despite how we do testing is, we do a loopback of the MOSIMISO signals and perform the testing at the host. We don't perform a host target loopback kind of testing. So I have a GitHub where we mention how we capture the traces and lays and how we tested this in our system as well. Okay, thank you. We are open for questions. You said at the beginning that this SPI controller might be used without the DMA, right? Yes, this can be used without the DMA. So, but you will be going up with the CPU usage a bit and that was something some of our customers was not very happy with. So they want to not for this particular use cases, they wanted to use it with DMA. And as Wigner Schaller mentioned in the target mode, we cannot use it with DMA because we need to be already be ready with what we need to push through this SPI shift register. And CPU cannot be that fast to do that. CPU also has to do other tasks as well. So two things in host mode, we can always use CPU but customers prefer to offload and keep the CPU usage very low and offload things to DMA. Okay, thanks, that was my second question. Okay, I hope there are no other questions. Thank you everyone for your time. Please speak to Richard on the mic.