 Hi all, my name is Shraddha and I am currently working in the system software team at Samsung Semiconductor India R&D. On behalf of my co-author Padman Avin, I am here to present a talk on Debug PCI, making PCI common error debugging easier. Through this presentation, we are going to cover the following points. First an introduction to the Debug features being provided by the Synopsys designware controller. Then we will go for the motivation behind creating this tool or why is it important to have a streamlined debugging process. Then a walkthrough of the PCI utils, which is the open source user space tool for PCI. Then we will introduce Debug PCI, a command line tool integrated into this very PCI utils. Then the architecture and where exactly in the software stack does our tool sit. Then we will take a quick glance at the code snippets and the implementation of the tool before we move to some real use cases and understand how this tool is going to be used to minimize manual effort and increase efficiency. Peripheral component interconnect express or PCIe as we generally know is a serial expansion bus standard for connecting a computer to one or more peripheral devices. Actually nowadays PCI has become so common and so extensively used to connect devices like graphic cards, NIC cards, SSDs and so many other high performance peripherals that doesn't really need an introduction. So there is an ever growing need of faster connectivity in today's digital world and PCI is actually keeping up with the demand. The demand for higher bandwidth, for faster speeds, for more lanes, for simultaneous data delivery and shorter channel length to fit into smaller SOCs. The evolution of this technology from PCI Gen 5 to now PCI Gen 6 is happening at an unimaginable rate. But with that the risk of errors is also increasing. External factors like power surges, noise, attenuation and EMI cannot be avoided and PCI is becoming exposed to higher fault rates. Now PCI is becoming the primary connectivity interface being used across all critical applications be it data servers, be it automotors, be it AI, be it high performance computing and as we all know, with great power comes great responsibility. And so SOC designers, SOC users, PCI engineers, application developers all need to make sure that the PCI setup that they are using is reliable and validated thoroughly for all conditions. And talking with the experience I have had as a PCI engineer like any other hardware, PCI is also susceptible to link failures all the time. And with the vast amount of possible errors that it can hit, it becomes really difficult to debug and it makes monitoring, collecting data and performing fault isolation for PCI based components challenging. Most IP vendors take error reporting, error logging and error recovery as its top priority and has a huge set of hardware registers giving immense amount of data to ease the debugging efforts. This aspect is not explored to a great extent and it lacks common software system to support it. In this presentation we have used this very information provided by various IP vendors and converted it into a diagnostic tool which can be easily integrated into any Linux system. We will see the need, the implementation and the result of this tool in the upcoming slides. To know more about the tool, I quickly glance to the different debug features that the PCI controllers provide. So there are two types of capabilities in the config address space of PCI. One is extended PCI capabilities which is available to use and common across all vendors and all architectures. So it is a protocol specific implementation. And second is the vendor specific PCI capabilities which is specific to different vendors. So some information like different types of correctable errors, uncorrectable errors, fatal errors, non-fatal error flagging is part of the common config register space and is part of the PCI capability called advanced error reporting or as we commonly know the AER. But some information is specific to the different vendors. Now most other vendors like Xilings or PLDA also have their own vendor specific extended capability to support this kind of error management. But we have created this tool for the design where IP and would take that as a reference for the rest of this presentation. So talking about design where controller and specific software accessible registers are present with very useful data which tell us about the current internal state of the IP. Like it provides the state of the link, the state of the equalization phases on each lane, the preset values being used and so on. It also shares the state of the different queues and the buffers. So whether there is any overflow, there's any underflow, whether anything is pending in the retry buffers and many more such information. It includes error counters for different types of errors like ECRC, LCRC, flow control errors, unsupported requests, completed reports and any other error condition that you can possibly think of that can be captured by the PCI controller. Also non-error events like how many times the retry buffers were involved or how many NAC responses were received. The number of the RX and the TX transactions being made is also available. Now this kind of information can be used to evaluate if there is any performance degradation or any unusual behavior which can lead to failures in the future. And why streamline debugging procedure is necessary. PCI is a complex protocol and has little or no visibility on the problems that may have a long term impact. Now of course there are existing ways to increase this visibility inside a device and some commonly used methodologies to troubleshoot issues. So a PCI analyzer as we can see in this picture is a test solution used to capture and interpret data packets transferred via a PCI interface. It includes hardware components which need to be connected between the two link partners so that it can capture the traffic and software components to display data in an understandable format. The protocol analyzer is the most versatile tool to debug PCI issues. But why does it need an alternative? So some of the problems or the disadvantages of this tool are 1. It is costly. 2. It is time consuming. 3. In some cases like a chip to chip connection where the PCI tracks are embedded onto the PCB. It is impossible to connect PCI analyzers. And of course 4 with the latest working model that we have today where employees don't visit office physically and we are all working from home. Connecting hardware analyzers might prove to be a difficult process. This is where the debug registers explained in the previous slide come into picture. And we can actually aim to narrow down or sometimes even solve complex issues using only software features. Because software these days do have the capability to capture all the hardware information and detail and so the dependency on these complex hardware tools can be minimized. But even with this software debugging we hit another problem. A lot of manual effort and hence huge chances of human errors. Now let me explain. In the absence of a software structure to support proper capturing and display of data. We might have to obtain the debug data using other mechanisms like JTAG maybe. So JTAG as we all know was designed to assist with the device, board and system testing, diagnosis and fault isolation. And today it is used as the primary means of accessing sub blocks making it an essential mechanism for debugging embedded systems. So first we need to obtain the debug data as shown in this picture using trace or any other way to dump memory. Then we have to go through this entire effort of checking the user manuals for register and bit information. Example if we have a link stability issue we need to find all possible fields that contain information related to the link. Because there are so many it is almost impossible to remember them all. Then we need to check the values in this dump, do some bitwise operations to obtain the required value because trace will show like the full register value. And then once we have that value that we require we need to try to conclude whether everything looks fine and everything is as expected. So this process as you can understand is huge and it involves a lot of manual effort. Let us see how we can save you and everyone working on PCI from this and make PCI debugging a little easier. So what is PCI Utils? PCI Utils is a package that contains a library for portable access to PCI bus configuration registers and several utilities based on this library. It runs on Linux, FreeBSD, OpenBSD, Windows, Darwin, Sigwin and there are ongoing efforts to add support to other systems as well. So currently the two most commonly used Utils are LSPCI and SETPCI. So LSPCI displays information about all the PCI buses and devices and SETPCI allows us to read from and write to PCI device configuration register. The open source link for the library is given in the slide and everybody can have a look. Let's take a closer look at this LSPCI dump. How much time do you think will take to scan through this extremely difficult to read statements and try to find if there was any error conditions set? The LSPCI gives information about the link speed, the link width, whether there was any completion timeout, whether the LDR is enabled and many such other information. But this information is incomplete apart from being a little difficult to read. For example, let me show here the link status shows that the equalization is complete in phase one but it was incomplete in phase two and phase three. Now how did I know this? This plus sign here shows that the equalization for phase one was complete and this minus sign here shows that it was not completed for phase two and phase three. But we failed to get further information on this, like on which lanes did the equalization of phase two and three actually fail? Or was there any other error conditions that was set or flagged along with this? Now there is a subset of some of the errors mentioned here in the uncorrected error status and here in the corrected error status. So like bad TLP, bad DLLP, some receiver errors, flow control protocol errors, completed abort errors. So these kind of errors are visible in the LSPCI output but again the data is incomplete. So suppose if the error is a lane specific error, we failed to get the information off on which lane did this error occur? And we also failed to get the information of how many times the error occurred. So we just get a plus or a minus to indicate whether the error was flagged or not. But we failed to get the number of times the error occurred. On top of that, there is no conclusion as to why the following errors can occur. Nor does the LSPCI output give us any suggestions on what can be done to debug these issues further. So that brings us to debug PCI, the solution to all these problems. Now we have seen the manuals for using PCI analyzers. Well it's a complicated process with lots of do's and don'ts and long procedures to begin captures and produce dumps. This here is a simple command line tool. The command used for capturing and debugging data is to first use the keyword debug PCI, then the bus number, the device number and the function number for the device that we want to debug. And then we write a C for capture, meaning begin capturing information and B for dump, meaning dump the information and provide the root cause analysis. So by now you must have understood this tool will first capture all the debug information, then dump them in a human readable format and then in one or two line statements it's going to give the probable cause of this error and the next steps that can be taken. For example there's a register and design back controller called the SD console to register, which indicates if there was a receiver detection related timeout. So suppose this bit gets set during our validation. The tool will give a message like a receiver detection timeout was seen. If the file requires more time for receiver detection, the application software can hold the LTSSM in detect active by setting the hold LTSSM field of the SD control to register. So it gives us information on what kind of error was seen, why the error is probably flagged and further steps that the application software can take in order to try to debug this further. If we look at the architecture and stack of PCI, in the physical layer or the lowest layer there's PHY, which is responsible for the actual data transmission. So PHY is the abbreviation for physical layer and it takes care of the functionalities like serialization of data, scrambling, encoding, decoding. So basically it's the electrical, the mechanical and the procedural interface to the transmission medium and it creates a link between the upper layer and the actual physical medium like optical fiber or copper cable. Above that is the kernel layer where lies the low level drivers for all IP protocols. So when we talk with respect to PCI, it has the PCI core files or the PCI subsystem layer with support for PCI features like MSI, AER, PME, Read&Write, which is like a common functionality across all vendors and all architectures and each vendor will implement these functionalities in the same way because it is a protocol specific functionality. And then we also have the platform specific drivers like the PCI Tegra or PCI Exynos, which programs the controller to perform controller specific various tasks. So the kernel layer also includes the function driver which binds to the link partner and provides the required functionalities. Then the topmost layer or the user space layer since the PCI Lib and the PCI Utils that we spoke about. All utilities like the LSPCI and SetPCI are like applications written in C code. So this is where we have added our new C file called the DebugPCI.C to enhance this PCI Utils utility and have implemented the required functionalities. The PCI Utils source code is build using appropriate compilers. In some cases we might need to use cross compilers when the device under test is of a different architecture. And the output bin that is generated is added to any generic virtual file system or the VFS. To the code implementation we have the main function here which I am going to take you through. So first we have defined a struct of type PCI access called the PCI access. So it's a pointer to the struct PCI access. Then we have a pointer to struct PCI Dev called Dev. And then we are going to check if the argument is less than 3 then the command is not properly given and we are going to return safely. We try to get the PCI access structure first using an API called PCI Alloc where we are going to allocate some memory for this PCI access structure. Then we are going to initialize a PCI filter. Now this filter is going to be used to be able to match the bus number device number and function number to the required device. And here our argument 1 is supposed to be the bus number device number and function number. So we use that argument 1 to initialize the filter using the API PCI filter parts slot. Then we are going to initialize the PCI library using PCI init. Here we pass the argument of struct PCI access pointer type. Then we get the list of all the devices that are connected to the system using the scan bus. Again we are going to send the PCI access structure pointer as an argument. Then we will try to iterate over all the devices one by one and try to find the right device. So we are going to first initialize the dev pointer to this, the head of the devices, the list of devices. Then we are going to go ahead till the dev pointer exists and every time after the iteration is complete we are going to point it to the next device in the list. Here we are going to use PCI filter match to try to match the bus number device number and function number to the one that we have sent in our command. Then we try to check our next argument or our argument 2 and see if the value is C. It means capture and so we are going to call the debug PCI capture function. And in case it is D, it means dump and so we are going to call the debug PCI dump function. Then we finally close everything and go for a PCI cleanup. So the next function is the capture function where we first are going to try to find the RAS base. So as we know the PCI capabilities are like a linkless structure and we iterate through this list and try to find the RAS desk capability structure. If the capability is not present, we are going to exit gracefully saying the tool cannot be used as it does not support the debug features. Now here we have a struct called the event counters which has member variables group ID event ID and the name of the error condition. So we define a static structure of type event counters named events which contains the list of all possible error counters that the designer IP provides. Now we iterate through this one by one in this for loop and we try to set the event ID in the corresponding event ID bit. We try to set the group ID in the corresponding group ID bit and then we are going to go for an event enable. Now here we are checking in case the group ID is zero, it means this error is lane specific error and we need to iterate through each lane that is possible and then also try to set the corresponding lane. So once we have this value which has the group ID, the event ID, the lane number and the event enable bit, we write it to the event offset this value. So that the capture can begin. To the dump function, it will again pass through each and every event counter mentioned in the previous slide like bad TLP, bad DLLP, dsq error, fctimeout sync, header error. Then go ahead and read the count or the number of times that the event occurred and print this information. This is done by iterating through the structure here. Then we are going to try to set the corresponding group and the corresponding event. So these bits are being set in the event selection register and then we are going to read the number of times the error occurred. So in case it is a lane specific error, we are going to read the number of times the error occurred for each and every lane by setting the lane here or otherwise it is just going to be a feature based reading. So we also have a static structure containing the registered details of all other debug registers. So here is another structure called debug which has debug registers like RX valid, RX electrical idle, receiver detection and some other information which is iterated through and required information is displayed either lane wise, here or feature wise. So we have a parameter called the lane debug, so we set this lane debug in case that particular debug feature is lane specific and if it is lane specific then we need to print the information lane wise. Then finally we call the root cause issue function which internally has a database with the expected values for each of these register fields. Our function matches the red value with the expected value and in case of any discrepancy it goes through a flow chart of possible scenarios, collects further required information and prints a message about the error. Now let's move to some real use cases or common PCI issues faced by system engineers and how would our tool come into picture here. The receiver detection occurs after 5 reset. It is used to determine if the link partner is present to establish a link up. So sometimes it so happens that the link is stuck in detect active state and link fails to come up or link does come up but in a narrower width than expected. So the common debug sequence would be to check the receiver detection status on which lanes the receiver is detected or check if any timeout error was seen during detection. So these things are detected by our tool and presented to the user. Now if LTSSM goes to polling compliance and does not move forward or come up in a narrower width, it could also mean that some lanes are physically broken. In that case the lanes will be detected but the RX valid value will show 0 in the SD status layer. The receiver could also have errors like buffer overflow or destew logic overflow which needs to be checked in such cases. Again our tool will have all this information in one place. There can sometimes be a speed change issue where link goes down or it goes to a lesser speed. So this can be an equalization issue and needs to check if each phase of equalization is completed successfully on each lane. We need to check the receiver preset hint. We need to check the local figure of merit. We need to check if any preset or coefficient request was rejected by the link partner. And suppose if the equalization has not completed successfully, we need to check if there was any direction change feedback from the file which is causing any coefficient rule violation. Or any other such information which can help us to know more about why this speed change issue has occurred. Now our tool comes to the rescue again and provides all this data in one place. For example it could suggest that some files may have longer link evaluation latencies and we need more link evaluation iterations to optimize the RX signal quality. So try increasing the upstream and downstream timeout limits by setting Gen3 EQ control of register bit 5 to 1. Or suggest that the predefined set of coefficients can be fixed for debugging and in which register can it be fixed. So again our tool provides the root cause in two or three lines and also gives the suggestions of what can be done further. Another interesting fact of why our tool is better than LSPCI or any other existing mechanism is that why AER registers will only state that there was in receiver error, our tool with the help of designware statistics register would tell you if it was a decode error or it was a disparity error, RX elastic buffer overflow error and so on. So in case any receiver error is received, the PCI link does not remain stable in L0 and keeps moving to recovery. Then there are correctable errors which is the most common consequence of a poor link quality and it leads to link instability. So correctable errors are not fatal and your system will keep working even after these correctable errors are flagged. But the amount of errors occurring or the number of times these kind of errors are occurring may suggest that there is a poor link quality and we might have to tune some parameters. Incorrectable errors can also be of two types. It is fatal errors or non-fatal errors depending on the severity. So if it is a very severe error, it might be fatal and the link will suddenly go down and the PCI system will stop working or there are non-fatal errors where the link does go down but the system can recover from a reset or any sort of other such mechanism. Then we have ASPM issues. So sometimes ASPM is the feature which allows the PCI to go into low power states for power saving like the L0 as the L1, L1.1, L1.2, L2 these are the different power states that the PCI controller can enter into. So sometimes when there is issues in being able to enter or exit these low power states, we hit into some ASPM issues which also need to follow a certain debugging process to figure out what the problem is which again our tool provides the facility to do that. Let me do an example of the output. This is a case where the PCI link is healthy. Doesn't it look like one of our blood reports? The aim was to make PCI debugging as clear as that. So what's the usual process that we follow? We get a blood test done and the reports show the different elements like WBC, RBC or platelet count and the amount of it in our blood. Then we take this report to the doctor who compares this with the expected results and in case of any issue they will diagnose the probable cause and tell you about the further steps. So our tool here is the pathologist and the doctor combined into one. So it will also take the debug information from the register and it will print the report in form of human readable format. Then it will also compare this data to the expected set of values and in case of any discrepancy scene it is going to try to diagnose why the error has occurred. Also this displayed here is just a small snippet and there are many more fields like the EQ presets, the equalization state of each lane and each phase. It also has the RX electrical idle status. So all of this makes it a go to for all situation kind of a report. Let's move to an uncorrectable error scenarios. So uncorrectable errors as we discussed render the link and related hardware unreliable. This is seen when any of the link components is severely broken and it forces the link to go into the detect state. So as we can see, okay, how about trying to find the error here? Isn't it much easier compared to the previous LSPCI dump? Here we can clearly see that the FC protocol error has been set. It also gives the count of the number of time it has occurred. So we get the value of one here. Here down is the root cause analysis which tells us why the flow control protocol error has occurred. It occurs if no DLLP is received within a 200 microsecond window. This indicates that the link quality is severely deteriorated. Next moving to framing error. Now suppose a framing error occurs. There are some debug registers which give the detailed information about the framing error. Like whether the framing CRC did not match or a STP token was received but not expected and about 20 other possible reasons for a framing error to occur as you can see here given in this slide. Our tool would have shared the exact information of why this framing error has come and also would have suggested to disable the framing error transition to recovery and check if the link is able to sustain. So as you can see, the framing error counter would have been set to one indicating the framing error occurred once and then we would have gotten a root cause analysis here that framing error indicates poor link quality and we need to investigate the PHY and the system level factors affecting the link quality. So it will also give us the type of framing error that the type of framing error received is when PHY status error was detected in SKPOS state. For debug purpose, this is the suggestion given by the tool and the debug purpose sets bit 16 of SD control 2 to disable transition to recovery due to framing error. Next we move to broken lanes. So sometimes after receiver detection is completed, the LTS system goes through the state of polling in configuration and then recovery before reaching the L0 state at genuine rate. But if some of the lanes are broken after receiver detection, the link may not reach this L0 in the desired width. In that case, the lane detection bits will show 1 for all lanes but in case of RX valid, it will be 0 for the lanes that are broken. So here our root cause analysis is the problem is an RX valid and since lanes 2 and 3 show RX not valid, it might mean that it has broken lanes. Another very important benefit is that this tool can be used in the pre-selican phase, emulation phase where there is no way to connect analyzers. The future scope involves trying to upstream this file in the PCI utils open source scope so that all can benefit from this. Also in future, we can have a common error capability and not vendor specific so that this tool can be made as a common utility which can be used across all vendors in all architectures. Now to conclude, a lot of time SOC users come to us with some PCI link issues. Now we have to tell them please dump so and so registers, please tell us the value of this register and then based on the results that we receive, we ask them that oh you can probably try to hold the LTSSM state or you can try to increase the timeout etc. In this long process, we often miss to mention some information or sometimes they might capture the wrong data which leads to all sort of confusion. All that is avoided now. All we need to say is please share the diagnostic reporter. Thank you and I'm available on the child box for any questions that you have.