 Hello, all hope everyone is doing good. My name is Shraddha and I'm currently working for the system software team in Samsung Foundry Business at Samsung Semiconductor India R&D panel. On behalf of my teammates, Padman Abhin and Pankaj, I welcome you all to the presentation titled non-invasive PCI RAS desk framework for automotive SOCs. This presentation is going to introduce you to the need of a software framework to detect and recover from any kind of PCI related errors which can lead to an unstable system. So let us quickly move to the agenda. First comes the introduction section, whereas the name suggests we will introduce you to this presentation and explain the motivation behind it. Next, we will cover how the RAS feature is integrated in the PCI base specification to provide link level reliability. Then we are going to take the example of design where PCI controller and talk about how its DES or desk features help in extending and enhancing the RAS features so that errors can be quickly detected reported and corrected. Next comes the very important part which explains what really is the need of a software framework and why what we already have isn't enough. Then I'll be explaining a bit of where does this framework exactly fit in our Linux PCI subsystem and how the implemented debug FS structure will look like in the user space. We will then move on to the different use cases where I will show you experiments and results and how sniffing hardware data and passing it to the application level helped us in easier and faster debugging. So let's begin. As you all know PCI is the industry standard IO interconnect. It is one of the high performance complex serial interface providing IO connectivity across multiple platforms like mobile, desktop and automotive. Transistors are getting smaller and smaller and SOCs are becoming subject to higher risk of failures due to external conditions like temperature, EMI, power surges. The transition to faster PCI speeds like PCI 5.0 and now 6.0 is augmenting the risk of errors due to time and budgets to incorporate shorter channel length inside the SOC or electrical issues like attenuation, jitter outside the SOC. As PCI Express is further being deployed into mission critical applications in the automotive, AI, enterprise markets, the need for higher level of reliability, availability and service ability is increasing. Due to this, SOC designers and users are looking for advanced mechanism to help in creating a more reliable, stable and robust product. Some PCI vendors and IPs do provide such features, but unfortunately they are not being used up to its potential due to the absence of a software system to support it. The framework implementation that we are going to present will decrease the debugging time of these fatal issues and provide mechanisms to detect, recover and prevent potential hazards without the use of any expensive hardware based PCI analyzers. Before diving into details of the develop framework, let us understand in a nutshell how reliability, availability and service ability features are available in PCI and how they make sure that underlying hardware failures do not cause interruption in the overall system operation. So let us begin with reliability. Features in PCI protocol like parity or ECC data protection and mechanisms like link CRC and end to end CRC error detection help in ensuring data integrity. An acknowledgement or not acknowledged, which is like the ACNAC mechanism handles seamless retransmission of error in a spacket and includes timeouts to ensure broken links do not go unnoticed. Basically any mechanism that allows the PCI interface to be more tolerant to external or internal conditions is considered a reliability feature and as we just understood PCI has many of them. Next is availability. Availability provides active information on whether the system is working as expected over a certain period of time. The implementation of event counters in the system, which can point out the number of times the system has entered into any error state or any LTS system state can be a good example of ensuring availability. Service ability explains how device can recover from any fatal error without disturbing the operation of the system. For example, hot plug ability is a feature of the PCI which ensures that we can restart PCI interface by unplugging and replugging the device without affecting the working system. Error injection can be really, really important to analyze the PCI response to errors and make sure that the service ability is high. So now before putting any product into production, if we have enough of error injections and we check how a system behaves or responds to those errors, it is going to be a real benefit. Quick and easy identification of PCI runtime issues or bugs is considered a service ability feature which the framework can provide. So now we are going to take the example of Synopsys Designware Controller and I'm going to explain in the next three slides of how much data can be captured, analyzed and used in PCI controllers to enhance the reliability, availability and service ability mechanisms. So DES stands for Debug Error Injection and Statistics. Now the debug feature that we are seeing here, so the D of the DES provides status of the current PCI system. So in case of any failure, these status registers can be read and the underlying software can understand where exactly the failure has happened and implement the respective fix. So if you see the different kind of debug features can be the data link layer up and down indicator. So it's going to indicate when the link suddenly goes down. So this is really important to understand the reliability of the link. The AXI slave and non-DBI transfer pending status, the DMA transfer pending status, the receive request pending status. There's a pulse that indicates that the receive queues have overflowed. This kind of data can be read in the debug registers. Now the state of the current link, like the, what is the LDSSM state where exactly is the RC or the EP in which LDSSM state are these and the EQ status, what kind of equalization has happened until now, what phases of equalization the link has gone through. All this kind of data details of enter and exit latencies of different ASPM link states. So in ASPM low power link states, there are different to enter and to exit some amount of latency is required. What were those latencies? This data can be used to analyze and further retune the five parameters to make this a more reliable and better system. All these are like the debugging features which can be used to understand the current system in detail and debug in a much faster and efficient way. Next we move on to the E, which is the error injection. The design where PCI controller provides support for the application to inject the following errors. CRC error, sequence number error, DLLP error, sync header error, FC credit update error, TLP duplicate error, specific TLP error. With the help of such kind of error mechanisms, we can inject common errors like ECRC, LCRC, unsupported requests, completed award, and we can check the error detection, error logging and error handling mechanism. So basically any sort of error injection is going to test the system on how fast or how efficiently the error is detected, how properly the error is logged, and how well is the error handled. So validating the system's error response mechanisms at an early stage is going to help reduce the risk and increase the robustness of the system. Next we move on to the S, which is statistics. One, the statistics feature keeps track of the various event counters that are happening throughout the system. And two, it keeps track of the time the controller spends in doing each of these various events. The following counters for time-based analysis are present. The time PCI spends in each LTSSM state. For example, the time it spends in L0, which is the common working state of the PCI, the time it spends in L0, STX, RX, which is the first power saving stage, and the time it spends in L1 and its substates, which have further stricter power saving mechanisms. It also calculates the time the link spends in recovery or in configuration. This feature also captures the amount of data processed by measuring PCI TX TLP header plus TLP payload. It does the same for PCI RX TLPs. So basically these debug counters help to understand the performance of the system by measuring the percent of time the controller spends in each LTSSM state and also obtain information regarding the TX and RX data throughput of the PCI in the system. Next is statistic event counters, which is clubbed into different kind of events. So one is the error events. Statistics stores the different kinds of errors that are happening in the system. These include errors like overflow of the buffer, buffer underrun, decode error, disparity error, receiver error, FTS timeout error, framing error, bad TLP error, ECRC errors. Basically any error that you can think of. Second is the straight state transition events. These count the entry and the exit for each LTSSM state, which means the number of times the PCI link entered L0, L0, AS, L1, etc. It also counts the speed change and the width change counter, the number of times the link downgraded its speed or its width or upgraded its speed or its width. Third is the response events. So basically it contains or records the number of times TX, RX, ACT, DLLP was sent, nullified TLP was sent, NAC, DLLP was sent, etc. Transaction events. So this will be counting the number of TX, RX memory read write, TX, RX config read write, IO read write, etc. that PCI is doing. So as we can see the controller keeps track of almost all events that are happening in the PCI system and this counter can be read by the application at any time to help in better debugging, understanding and maintaining of the system. Now we move to the need for the proposed RASDES framework. The Linux kernel being used by the application has generic frameworks to log some of the RAS or the AR events. We all have heard of the AR driver present in the current Linux subsystem, right? Errors are reported on the RC, but one, there is no support present for errors being reported on the EP side. Two, lot of errors that are part of this RASDES feature is not a part of the AR handling. Three, features like error injection and statistical monitoring is not present. Another commonly used user space utility is known as RASDEMI. This monitors the RAS events from the kernel trace log and reports it on the system log. But since PCI IP supported by multiple vendors, the usage of extended capabilities of the PCI RAS or usage of the more advanced RAS features turns out to be cumbersome and not so useful. In short, the generic RAS features on the Linux is not capable of providing complete debug and statistical information for debug. Our framework is going to dig this important data and bring it up to the application level in an easy and readable form. Associated designers are looking for mechanisms that provide better visibility into the PCI interface behavior and better control over its operation. For suppose implementing programmable timers and timeouts inside the PCI interface logic as well as mechanisms for generating errors without side effects will improve the reliability of the system. The RAS features gives us a way or an opportunity to tweak the timeouts for various events. Also having dedicated monitoring status and control interfaces for each PCI interface functional layer like the file layer or the MAC layer will allow associate designers to flag specific events and errors and improve the overall serviceability and availability of the system. So currently most of the protocol errors are captured and interpreted to a usable readable form by a hardware PCI analyzers. These analyzers are connected between the root complex and the endpoint of the PCI subsystem. They capture the PCI traffic but some applications use PCI chip to chip communication where the communication is achieved through PCB tracks. So it does not use any external hardware connections. In these cases connecting a hardware PCI analyzers is going to turn out to be impossible. Moreover hardware analyzers are really expensive and hard to maintain. Our framework will to some extent replace this hardware analyzers which will be a lifesaver. So let us now move on to the proposed framework implementation. This figure shows the software stack of the RC and of the EP. So like you can see the existing structure goes down from the virtual file system or the VFS in the user space down to the PCI file via which the PCI transactions will finally take place. In this flow we have added two modules that will help us achieve all that we have spoken about in the previous slides. The PCI RAS framework will be sitting with other PCI device drivers. This driver will basically pluck out the data by reading hardware registers and passing this information to the user space where a debug FS system will display this data in an understandable form. Some RAS features also allow us to tweak hardware parameters for debugging. Example we can increase the ASPM entry or exit latencies or hold LTS7 in a particular state. In this case debug FS will accept input from the user and pass it down to the RAS framework which will finally take care of implementing it in the hardware. So what is debug FS and why did we use it? Debug FS entries as we all know is a special file system available in the Linux kernel. It is a simple to use RAM based file system specially designed for debugging purposes. It exists as a simple way for kernel developers to make information available to the user space. Unlike PROC which is only meant for information about a process or CIS FS which has a strict one value per file rule debug FS has no rules at all. So since there is no one value per file rule you can also dump like a full buffer in debug FS entries. Developers can put any information they want there. To compile a Linux kernel with a debug FS facility the config debug FS option must be set to yes. Once we boot the system it is typically mounted onto slash CIS slash kernel slash debug. A lot of interfaces like the clock, Ethernet and GPIO use the debug FS in some form or the other. So this is how we have split our debug FS entries into multiple files and organize them in a hierarchical directory structure. The goal was to make it as simple for the first time users to understand. So move down from CIS to kernel to debug. This is where our debug FS entry will be mounted. Suppose we have five PCI controllers in the system. Each of these controller is going to have a separate directory in the debug FS folder. Depending on which controller RAS data we want to read or understand we can enter the respective directory. Inside each controller there will be three sub subfolders each for debug error injection as statistics as you can see here. Inside the debug there are n number of files each having a specific debug value that the user or developer can read to explore the current state of the PCI system. Suppose we want to read the debug feature and know whether the RC has detected EPs lanes. For this there is a debug value called lane detection. So if we cat on this file we will be able to see the value 0x0 for lane not detected and a value 1 for lane detected. So suppose there are four lanes in the PCI connection. We can write onto the file to indicate which lanes value we are looking for because a feature like lane detection is dependent on the lane per lane. So in this case the lane which we want the information for should be echoed onto this particular file. So echo 1 to this file will read back detection status of lane 1 when the cat is done. Similarly inside error injection there are n number of files each having different errors that can be injected. Suppose we want to inject the TX LCRC error. Here writing to the file is going to tell the driver how many LCRC errors we want to inject. If we write 5 here for the next 5 PCI transactions that will be initiated it will have LCRC error injected into it. On reading back this file it indicates how many errors are left to be injected. So after making two transactions if you read or do a cat on this particular file it will show the value 3 meaning 3 error injections are left. Now let's move to the statistics directory. Instead of directly having files to read we have sub-directories for each type of event that we had seen in the previous slide. So the counter can be error count or an entry count. Now within this sub directory 3 files will be seen. Counter enable file whereas the name suggests it is to enable or disable the counter. Writing 0 will disable this counter and writing 1 to it will enable the counter. The next file is the counter value and this is exactly what we need. Reading this file will give the count of the number of times the specific event has occurred. Another file is the lane select. So this file only comes into picture where the event is lane specific. Suppose receiver error this is something which is lane specific event. And for each of the lanes present in the PCI subsystem the error count can be different. In that case we need to specify the lane that we want to read the counter for. Now we are going to move on to one use case each from D, E and S which is the debug, error injection and statistic feature. Where I will first show you a glimpse on the implementation and then try to narrate an experiment that we did with the feature and how it helped us in our daily debug game. So let me give you an idea that the current implementation is done keeping in mind the design where controller. Since the RASDES features implemented by different vendors are vendor specific. In future these features can be a part of the PCI protocol like AERs today and then all PCI vendors can implement these hooks that the general framework provides. Currently it's written only for the design where PCI controller but it can easily be ported to other vendors as well. Our patches are currently under review in the kernel mainline and I have provided the link for the same. Now let's move to the implementation. This is how a basic debug FS read or write function will be written. The read function is going to read data from the hardware registers and copy this data from kernel buffer to user buffer. So as we can see a simple read from buffer function does this copying of data. The write function copies data from user buffer and does the respective programming to the hardware registers. So here you can see that we are reading this SD lane status register. It provides information like whether the RX is valid, whether lane is detected, if TX and RX lanes are electrical idle. Other debug registers provide options to disable transition to recovery in case of framing error or it can hold LTSSM in a particular state manually. Or it also gives an option to jump to a particular LTSSM state directly by skipping some of the states. It can also dump the internal PM state of the master as well as that of the slave interface. All of these debug options are extracted from registers and displayed in different files as shown in the previous slide. This is then the experiment that we had conducted to make use of the debug feature. So the problem statement that we were facing here is when we were trying to link up between the RC and the EP, the LTSSM state was stuck in detect quiet and it was not moving forward and the link up was failing. So first I'd like to show you when you do a CD to CIS kernel debug you will be able to see the platform, the folder for your PCI platform present there. Inside that you can see the RAS disk debug registers which is inside which there can be different features like the force detection, lane detection, RX valid. So now we echo one to the RX valid which means I am trying to read the RX valid value for lane one. Now when I cat the same it's showing a zero. Similarly I do that for the lane detection and I read back zero. Now these values zero show that neither has the lane been detected nor is the RX valid. Currently at this time this is expected as we have not yet enabled the LTSSM state and the state transition has not begun yet. So currently the values zero are expected. Then I ran the link up script which initiated the link up to begin between the RC and the EP. But what we received was a file link never came up message which indicated that the link did not happen properly. Post this the first thing that we had to do was read the current LTSSM state and realize that the state was still stuck at detect choir. Now I wanted to check whether the lanes are valid and whether the lane has been detected. So when we did an echo one to the RX valid which is for lane one and I read the RX valid. I found the value to be one, which means the RX valid had been detected. Now RX valid is something that can detect that can tell you about whether a lane is broken or not. So if the value would have been zero it would have indicated that there's physical broken lanes on your PCI system. But in our case it was one indicating that there were no broken lanes. Next we I went on to read the lane detection value. Here the lane detection value was still zero. So this indicated that there was something wrong with the lane detection and it made us curious to know why this was happening. So on further debugging it was identified that due to a five parameter issue the lane was not detected in time even though it was present there. So each LTSSM state has a timeout and somehow a five parameter wasn't detecting the lane in that particular time frame. So here we tried to force lane detection, which is another debug feature. And the lane can lane detection can be forced from software. So when I wrote 0xf which means force detection for all lane 0123 and then try to link up we could finally see the link was up and done successfully. So this is how the various debug features like lane detection, RX valid, whether your RX is idling, forcing LTSSM states, holding LTSSM states can be useful in debugging a link up scenario or a width downfall scenario, speed downfall scenario, etc. So now moving to the error injection implementation. Here we have not created separate functions for each error injection as the offsets are in order. So when we do cat on a file and the read function is called we have extracted the error number from the file name. So depending on whichever file name is there, the error number is extracted. And according to that the offset is set. Then we read the register of that particular error to indicate the number of errors that are left to be injected. Similarly, in case of write we copy the value from user buffer and program the respective register to enable error injection and update the count of errors to be injected. This again extracts the error injection number based on the file name. There are overall around 35 errors and all of them use the same read and write function. The error response can be checked by clubbing debug and statistic features after injection. So for example, when a framing error is injected and it occurs, the PCI system is supposed to move to a recovery state. So after injecting the error, we can read debug registers to note the current state is in recovery. We can read the framing error pointer and also check if the count of framing error has increased. Or suppose after an LCRC error, the system is supposed to retransmit the PLPs and recover on its own. So we can check the PLP retransmission count and make sure the data was transferred properly. Now here we can see the experiment that we did to inject errors to test system response. I'm going to take the example of LCRC error injection. So again, we need to go to says kernel debug, but this time instead of going into the debug folder, we go into the error injection folder. Inside the error injection folder, you can see immense amount of errors that can be injected. For example, here we can see RxECRC, RxLCRC, TxECRC and TxLCRC. So suppose we want to inject LCRC error. So now we are going to echo one in the LCRC error counter-enable. So before injecting any error, I am enabling the LCRC counter so that post injection we can check whether the error was actually injected. So here I am just echoing one to counter-enable to enable the error and then I'm reading the counter value to make sure that it's zero before injecting any errors. Next, I'm going to inject the LCRC Rx error. So for that, I'm going to write one. Let us just inject one error. So I'll just write one. When I read back, it shows me one, which means that one error is left to be injected. The counter value is still zero because the injection has not yet been done. As I mentioned before, the injection will only be done when you initiate any transaction. So now we are initiating a transaction using a custom script that we have. So here as you can see the test came as OK, which means the test did not fail and the data was transmitted successfully. This is because LCRC is a correctable error. So by this, we made sure that the system response to this error is correct. Now, when we read the LCRC counter value, it is one indicating that the error injection actually happened and the system was able to detect this error. When we now read the RASDS error injection value, it is showing zero indicating that zero errors are left to be injected and the one error has been injected. Next is the software implementation for statistics. Here again, we have created a single read and write function for counter enable and disable and extract the event number using file name. The counter value is a read only function where we are reading the corresponding hardware register and inform the user of the number of times the event has occurred. Similarly for lane select, we have a write and read function which will copy required data from the user buffer and set the lanes in the hardware. As discussed before, it can dump data regarding fatal errors, non-fatal errors, correctable errors, non-correctable errors and so on. However, it is different from AER in the form that this debug FS is also available to be read on the EP side. As we know, AER only has support to inform about the errors on the RC side. And of course it has much, much more information when we compare to what is tapped by the AER. Now, if anyone has used a hardware PCI analyzer, we know that it generates a report that contains a pictorial representation of how much time the link has spent in each state. Each LTSM state may be recovery, config, L0 or L1. Now in this case, we can do that by simply reading a debug FS entry. This is how the framework is helping in efficient debugging. Let's try to understand one of the experiments that we conducted regarding the statistical feature. So we were facing a problem with extra power consumption in one of our test suits. So to debug this, what we did was we tried to read the L1 entry counter just to understand how many times the link enters L1, how much time it spends in the L1 and so on. So now again, first I'll explain you the directory hierarchy. So we'll go to CIS kernel debug and this time instead of debug or error injection, we'll go to the RAS desk counter directory. So inside that there are various statistical folders having different event counters like you can see computer timeout error, abort error, decode error, L1 entry, what is the speed, how many times the speed has changed, how many times the width has changed, unsupported request error and so on. Now when we go inside any particular event folder, for suppose L1 entry, we can see the three different files, counter enable, counter value and lane select. When reading the counter enable, it shows disabled as by default all the counters are disabled. The counter value is zero because yet nothing has happened and the counter is not enabled. So then firstly, we enable the counter. So I just enable the L1 entry counter and on reading it back, I could see that the counter is enabled. The value on reading is still zero because the link has never entered L1 yet. Now we're going to run a script to enable L1. The read value shows 14 which shows that the LTSM state is in L1. Now we're going to cat the counter value and I saw that the value was 6A, even though the value should have been one because we initiated the L1 entry only a single time. On further reading the counter value after a few seconds, I could see that the number or the counter value to L1 entry was increasing. This was more than one despite no active transactions to cause L1 exit and reentry. On every read after delay, the counter was increasing. So this led us to discover about the ZR XTC compliance in Phi. The non-ZR XTC compliant Phi exits L1 state after every 100 milliseconds. So this is why we were facing excessive power consumption because even though the link was in L1, it kept exiting back to L0 which included the latency to exit. So we do have a MAC register which can flag this and skip this multiple exit and entries. So we can inform our MAC that our Phi is ZR XTC compliant and in that case it will skip the entry and exit multiple times. So we try to do the same experiment again after making the controller as ZR XTC compliant. We wrote 0 to counter enabled which will disable the counter and also reset the counter value back to 0. So on reading back the counter enabled register, I could see that the counter is disabled. Next we also read the counter value and it shows 0 as expected. Now we performed various PCI workload which would cause the L1 exit and reentry. The counter remains 0 as the counter is currently disabled. So after a point we enable the counter and then run a PCI load. In this case the exit has happened to L0 and the reentry to L1 happened. Hence on reading the counter value I could see the value is 1. We waited for some time and read the counter value again which was still 1 indicating that now multiple exits and reentries are not happening. On another transaction the exit was re-initiated and hence on reading the counter value we could see the value is 1. So this gives us an idea of how this feature can be used to detect and debug various kind of issues. Coming to the conclusion and future scope to conclude that the current implementation has a great advantage for the critical automotive applications is that the automotive solution mandates a need for reliable and easily repairable solution. Also the error report are handled by corrective mechanism which will reduce human intervention. The error information projected to the user helps in understanding the nature of the error thereby increasing the robustness of the system. The future scope of this development involves implementing a unique mechanism to automatically detect the RAS errors and resolve the errors without taking it to the user level. So we hope that soon we will be able to automatically detect the RAS errors. The developed framework can also be integrated to the Linux kernel RAS subsystem for generic implementation thereby providing flexibility of using it across various architectures and other PCI windows. Thank you. Me and my team members Padmanabhan and Pankaj are here in case you need help with any query. Thank you.