 Greetings everyone. Welcome you all for the presentation titled Deans Defending PCA Performance and Tuning Parameters. This is Patan Mohan working as a system structure design engineer in Foundry Design Services team of Samsung Semiconductor. My colleagues and co-authors Mr. Pankaj and Mr. Dhaathya. I will be presenting all the data and myself along with the co-authors will address all your queries at the end of the presentation. Let's go to the next slide. In the next slide, we have the agenda detailing the upcoming contents. The presentation starts with an introduction which specifies the need or motivation of the services which showcases how PCA has a much wider bandwidth when compared to the other seated protocols and why PCA bandwidth needs to be utilized to the maximum extent possible. Following the motivation slide is the factors or parameters which affects the performance of the PCA based systems where each factor has been identified and segregated based on its tuning flexibility. The next topic will be the in-depth analysis of some of the important factors which shows a significant impact when doing such as PCA links being linked with the maximum payout size, the maximum read request size, the overheads due to ECRC and including schemes. The latency is due to ASP maintain and some link level mechanisms such as slave scan flow control, maximum outbound non-postage request or also considered here. Finally, this presentation concludes how these factors help improve the performance of the PCA incorporated system and how the analysis can be extended. Let's dive into the contents. As many of you are aware here, PCAE is a high speed interconnected being favored by most associated systems for high speed data transfers between the subsystems. The PCAE protocol which was first introduced as a local bus standard has become a standard bus specification which improves the performance of high performance computing and automotive applications. PCAE is not now restricted only to the computing applications but also to many applications which is used for non-computing purposes also. The PCAE SIP devices the protocol to support future enhancements by making the IP convenient for speed upgrades and better encoding techniques. Though the PCAE speeds are tremendous when you consider standard home, things become different when they are incorporated into an SOC. Many factors including link width payload size, data path latency, PXRX buffer size plays a crucial role in determining the performance of the PCAE incorporated system. These factors needs to be considered when designing a PCAE based system which focuses on either having a high performance or high power saving use cases. Some of the tuning factors are flexible in such a way that it can be tuned even during systems runtime which can be improved in such a way that if there is a user space application or a software which is running on top of the system, the software can tweak it during the runtime operation thereby it can switch to a low power mode or a high performance mode. This enhances the system's flexibility where it can switch between power and performance on need basis. Therefore the SOC PCAE has to understand the tradeoff between the cost and the performance of the system which can be tweaked to align with the use case of the system. The PCAE unit has to consider the accessibility of tuning these factors and provide hardware registers or software groups so that the user space applications or the end users can switch the system to enter into the performance or power saving mode. This presentation explores the tuning factors available in the PCAE incorporated system. The presentation utilizes the SOC which has a synopsis between the PCAE controller. We can have a look into the tuning factors in the upcoming slides. Let's move on to the next slide. This slide shows the list of factors which are segregated based on their affinity to the PCAE presented in the system. So we have segregated it based on the protocol level, the OS level and the IP configuration. On the protocol level we have ASPM latencies which considers how entering and exiting the ASPM state during heavy workload affects the performance. Then followed by the ECRC and encoding scheme which details the overhead of CRC and encoding techniques on the data which being transmitted on particular PCAE generation bus. On the operating system level we have the operating frequency of the SOC. This operating frequency can cause propagation delay from the application to the PCAE buffers thereby reducing or impacting the performance of the PCAE based system. The other factor is the ability of the processor for multi-threading where different processors can access the PCAE port in parallel. Improving the multi-threading concept also improves the PCAE performance of a particular system. Next moving to the PCAE IP configuration level we have the maximum payload size and the maximum read report size highlighting the relationship between the PCAE buffer size and performance. The link speed and the link width is also considered which shows the correlation between the number of links and the transmission speed on each link. And finally the scale flow control and the maximum outbound non-postage request is also considered. In the upcoming slides we will be discussing about some of the parameters and its usability whether they can be configurative in the same time or in that time or whether it gives a significant boost to the performance or power saving features of the system. Let's move to the next slide. This slide gives a glimpse of the test bench which we have used to capture the throw holder by tuning the hard register. So our test bench consists of a host processor which has PCAE dual control controller but for this specific purpose which has been configured in a root complex mode. On the other side we have an ARM64 based SOC which has a PCAE dual control controller configured in input mode. Both the input and the root complex are Z4 capable and can link up with a maximum link width of 4. The test bench also utilizes almost all the links which is available and then performs the test occurring. We have used a custom tweaked Linux drivers commonly available in user space and bare metal software to access the tuner register to capture the throughput data. The software test bench plays a major role in capturing the time taken for the transferring data between the controllers. The test bench utilizes the embedded PCAE DMA controller to transfer the data. The right interval between the software reading the rover of DMA and the done notification from the DMA is captured and the transfer rate is calculated. Since the transfer rate is being calculated, this is calculated based on the size of data being transmitted. This experiment is repeated for different data sizes ranging from 1 byte until 2 megabytes. We brought out all the data from the test bench and a graph has been plotted based on the captured data with multiple viewing parameters being tuned at the different test levels. Which we will be looking in the upcoming slides. The first factor which we are about to look into is the encoding variant. In order to maintain the TC values on the transmission lines, the PCAE uses encoding schemes. This encoding scheme varies across the generations. Let's take an example here. For PCAE Gen 1 and Gen 2, the protocol follows 8 slash 10 bit encoding schemes. This proves the fact that an additional two bits are being overhead to the three PCAE throughput here, which is roughly 20 percentage here. Ideally, if we calculate the bandwidth for Gen 1 base systems, the actual bandwidth only corresponds to 80 percentage of the protocol specified transfer rate, which roughly translates to 250 megabytes per second on a single link. This is a similar case for Gen 2 also, which is 500 megabytes per second. However, this overhead gets minimized in Gen 3 and Gen 4 base systems where 128 slash 130 bit encoding is being followed. This reduces the throughput overhead from 20 percent to mere 1.5 percent, thereby doubling the practical bandwidth from Gen 2 to Gen 3. And the other factors encoding overhead is a protocol specified and must be followed for proper working of PCAE. It cannot be tuned for better performance. However, the user can still design on proceeding with the Gen based on the use case and size of the transfer. Let's say if the use case is based on using the PCAE range for any graphical applications, the graphical application can effectively use most of the PCAE bandwidth should opt for higher PCAE speeds thereby transferring more data on a single transfer. Any other application which involves only few or lesser number of transfers can go to a lower generation speeds also. Let's go to the next slide. The next factor which we are about to look into is the link speed and the width. Usually PCAE width is specified as number of links onto which the PCAE root complex and the link point can successfully link up and perform data transfer in parallel on the successfully linked up links. The PCAE speed is the actual rate in which the data is being transmitted. As per the protocol, the link speed determines the number of PCAE transactions possible and it is measured in giga transfers per second. Here, if we calculate the PCAE bandwidth for a single link and in the single direction, it will be as similar to that of what it is specified in the table here. So for the Gen 1, in giga transfers per second, it is 2.5 and in megabytes per second, it is 250. For the Gen 2, which is 5GT per second and final megabytes per second, for Gen 3, it is 8GT per second and 1 gigabytes per second. And in Gen 4, it is 16 giga transfers per second and 2 gigabytes per second here. The maximum which is achieved in our threads bench is Gen 4 speeds, which is 2 gigabytes per second, though we have PCAE Gen 5 and Gen 6, which is double and four times of Gen 3. However, in order to calculate the actual bandwidth, the designer has to consider the efficiency of the data being transferred, the number of links, and the encoding overhead also. So the actual bandwidth will be in a number of links, multiplied by the data rate on each link, multiplied by the encoding overhead, and the efficiency. So we were able to capture the performance impact due to link speed and link will change in our test bench. So we captured the performance data from speeds ranging from Gen 1 to Gen 4, where all the data ranges which has been transferred has been captured, the bandwidth has been captured for all the data ranges, and it is plotted. So in all the test bench cases, the MPS and the MRRS, which is the max payload size and the max rated work size, has been kept to the maximum value possible, which is 512 bytes in MPS case in our test bench and 4096 bytes for MRRS in our test bench. To perform the data transfer between the link partners, we have used other PCAEs internal DMA, so that it performs block transfer, and it also provides an efficient way to capture the read and write rates here. The effective bandwidth has been captured based on the data size ranging from 1 byte to 2 megabytes, which is transferred on the PCA bus for all the force speeds and multiple link weights, let's say between X2 or X4 link width. And that plot has been done based on the data which we have captured. These plots will be analyzed in the upcoming slides. Before getting into the plots, let's look into the prototype code for where we were able to dynamically switch between the link speeds and the link width without performing a hard reset or a soft reset here. The control which we have used provide features such as speed of configuration, directed link speed and link width change, which is available in the form of hardware registers. And we have used a custom tweet parameter code to configure these values through software. The reference code will be helpful where for the application designers to switch between the performance and the power saving features based on their need. It also allows the user to configure the link speed and the link width dynamically, so that any configuration related changes or any broken links can also be fixed in this method. Let's go to the next slide. Based on the dynamic link speed and width change, we were able to capture the performance data for all capable speeds in the test bench, ranging from gen 1 to gen 4. The MPS and MRRS kept it as maximum value possible, which is 54,096, and we have used DMA to capture the read and write rates. On the left side, you could see the graph which actually plotted based on the read transfers and on the right side, we have all the data which has been captured based on the write transfers. The effective bandwidth has been captured based on the transferring of data of sizing from 1 byte to 2 megabytes for all 4 speeds and it has been plotted here. From the read and write plot, we could understand that from sizes ranging from 1 byte until 1 kilobyte, what are the generation needs? The performance is almost similar. So the user or the developer has to understand the fact that even though we are trying to send the same data on the PC inverse, we could still see similar performance on all the generation here. Which shows the fact that the user can switch to a lower generation or he can switch off for a dynamic reading downgrade whenever the user is trying to transmit the data which is less than 1 kb. And for sizes ranging from 1 kb to 1 megabyte here, the speed, dynamic speed kicks in where the performance gets improved based on the protocol specified transfer rate. And for each speed ranging from gen 1, gen 2, gen 3, gen 4, the performance is directly proportional to the data rate in which it is getting transmitted, which is also seen on the graphs for the read as well as in the write case. For sizes greater than 2 mb, the performance gets saturated since the PCI maximum bandwidth is being occupied by the large data sizes and it alludes to the maximum possible data rate for the generation. So the developer or the designer has to understand these facts and design their application based on the use case. Whether they can switch to low power mode by switching to lower generation speeds or have a high performance applications running with highest possible speed in this case here. Let's move to the next step. Similar to the previous experiment we have done for the lead speed, we have also captured data for the lead width here where the lead width has been kept to x2 and x4 and the lead speed has been kept to the maximum value possible, which is gen 4 here. Similar to the lead speed performance, we could understand the similar fact here where from data size ranging from 1 byte until 1 kilobyte, the performance doesn't deplete to the maximum extent possible. It is as similar to that of what it is in case of x2 as well as in x4 leads. The performance gets improved as the size increases from 1 kilobyte to 1 megabytes and similarly saturated for sizes greater than 2 megabytes. This is due to the fact that the PCI maximum bandwidth is being used or utilized by the data which is getting transmitted and it is actually proportional to the corresponding data rate for the particular generation. So if the transfer size is less than 1 kb, the user can switch to a smaller lead width in order to save the power. This would be very useful for the low power saving applications here. Let's go to the next slide. So the next slide talks about the max payload size and the max read request size. The maximum payload space specifies the largest payload space the device can support for a single transfer. This usually depends on the least value supported by all the lead partners in the PCI thing, which is the weakest lead in the PCI thing. The MRRS defends the maximum size which the lead partner can request for a read transfer. Based on the MRRS size, the lead partner requesting for a read will prepare buffers for the incoming response. The lesser the NPS and MRRS, the more time it takes to write the data into the lead partners. And specifically, the lesser the MRRS, the more pending read request will happen on the RC side when we are transferring data sizes greater than the PCI maximum transmissible unit. And this is not specifically called the RC side. It is actually based on the lead partner who transmits the data, which we can investigate in the upcoming slides here. So similar to the dynamic lead speed and lead width application, we have used a similar test bench here where we capture the performance data for different NPS and MRRS values. So in order to achieve this, we are not performing any software insert or any hardware insert. Instead, we use a user space application which is based on Linux environment where the user space application namesake PCI changes the NPS and MRRS values of the lead partners on either side and triggers the lead retrain. So after the lead retrain, the NPS and the MRRS values gets changed to the corresponding value which we are trying to program in our test bench. This script will be helpful where the users can understand how switching between the lowest NPS and or the highest NPS is possible without doing a software insert or a hardware insert here. The user can also switch between the NPS and the MRRS based on the link to which they are trying to communicate here. If that is the case, the NPS and MRRS concept will not depend on the witness link in the thing, but then it will be dynamically changing for all the links which has a maximum capability here. The script reference which has been provided in the upcoming slide will show how the device control capability field is modified to change the NPS and MRRS values here. And finally, the performance data has been captured by transmitting different data sizes for different NPS and MRRS. And the speed has been kept to Gen4 and the width has been kept to GenX4 in the test bench to perform this operation. We will then look into the analysis in the upcoming slides. So similar to the dynamic link speed and link with modification based on the parameter and custom data program, we were able to use a commonly available user space application based on Linux called CPCA. It will dynamically change the NPS and MRRS for the link partners and trigger the link retaining here. So if you could see the reference purpose code which we have provided here, the script actually uses a CPCA to change the device control capability field and it changes the NPS and MRRS values for the root complex as well as the endpoint device on the same script here. So the script will change the root complex device capability field and then from the root complex it configures the endpoint device capability field also. And it changes the NPS and MRRS values which is desired when on the test bench we are running. Since we are changing the NPS and MRRS dynamically, this script will be running for all the cases, let's say for the changes of NPS from 128 until 512. And for changes of MRRS from 128 until 4000 cases. So after configuring the device control capability field, this script actually returns a link retraining so that the NPS, PCA state machine gets configured based on the NPS and MRRS value which has been configured immediately here. If you see in the script which has been provided here, the device control capability field has been specified as offset 78 here which is corresponding to our test bench and the controller which we have used in our test bench. This offset will be varying for other systems and other PCA controller which is used in their systems. So based on this, the user can identify the device cut on capability field and use a similar script to make changes accordingly. We can see how this is impacting the PCA performance in the upcoming slides. So we were able to capture the performance data for different values of NPS and MRRS on Gen 4 speed and export with test bench. So for the initial experiment, we kept the MRRS to be the maximum value possible which is 4096 bytes in our test bench. For the read as well as in the right case, we did vary the NPS size from 128 until 512 for the transfers via PCA in internal TME. In both read as well as in the right case, the plot shows similar performance for data sizes until 16 kilobytes which roughly proves the fact that for sizes from 1 byte until 16 kilobytes, whatever the data being transmitted, the performance is one and the same on all the NPS values. After 16 kilobytes, that is for the sizes greater than 16 kilobytes, the performance improves for the highest maximum payload size which is the NPS here. And the highest maximum payload size, the better the performance which is evident from all the graphs which we have protected. So the performance gets improved for the maximum payload size of 512 bytes and it saturates to the maximum protocol specified bandwidth here. So if the user wants to occupy the complete bandwidth and use the complete PCA bus for the data transfer throughout the process, we can switch to a maximum payload size of the maximum value for the corresponding lead and perform the read or the write transfers accordingly. Therefore, it is recommended to switch to the maximum possible value for better performance here. The similar experiment is done here keeping the NPS to the maximum possible value which is 500 into our bytes. The MRRS is now switched from 128 bytes until 4096 bytes and the plot has been captured based on the different data size transmitted on a Z4 X4 link. So if we can see on the DMA lead side, the performance gets improved when the MRRS size is increased from 128 to 512, which is evident from the graph that after 1 kilobyte of data, the performance gets improved based on the MRRS values. The higher the MRRS value, the better the performance. However, we could see from the graph that after the 512 byte mark of the MRRS, the performance remains one and the same. This is due to the fact that the maximum payload size is still 512 bytes, which results in bandwidth kept to the maximum after 512 bytes. However, on the DMA write side, changing the NPS value, keeping the NPS value to the fixed value and changing the MRRS value to the maximum possible does not give any performance improvements here. This proves the fact that the maximum read request size concept comes only in the read feature and not in the write feature, which is evident from the experiment which we have done in our test batches. Let's move to the next one. So the next factor which we are about to look into is the end-to-end CRC over here. This is an optional feature being provided in the PCH specification, which improves the AER feature of the PCH area. ECRC makes the whole transaction reliable up till transaction here, as the LCRC is being recalculated for the packet at every egress port of the delivery element. Every TLP has an ECRC packet up and then at the end, which is of 4 bytes here. This adds an additional overhead to the performance of the PCA. Therefore, if the designer is aware and conscious of the environment in which the PCA work is stable, they can skip the evening of PCA here. Considering the use of ECRC is debatable as solutions like metric interface card which utilizes PCA interface occupies larger PCH bandwidth for longer durations and this might lead to performance problems. But the user should also consider using the DMA in the case of having end-to-CRC enabled, where there could be significant improvement when DMA is enabled here and ECRC is enabled. Because the DMA is trying to perform a block data transfer and the ECRC, which is of 4 bytes, get uplanded for each block in this case. We can analyze our test bench which has the DMA enabled case and we can see how ECRC enabling or disabling is impacting the performance or not. Let's move to the next slide. So, in order to identify whether the ECRC overhead is affecting the performance in our test bench where the DMA is enabled, we have captured the time taken to transfer the data size ranging from 1 byte up till 2 megabytes at Gen4 X4. So, similar to the previous test cases, we have kept the maximum payload size to the maximum value which is 512. The maximum read request size here is kept to the maximum value which is 4096 and the DMA has been used to perform read transfer as well as the write transfer here. So, from the read as well as in the write graphs, we could see that the ECRC enablement or disablement shows no significant difference in both cases as we have used the DMA based transfer for read as well as in the write case and the ECRC is being calculated for the entire block. So, this ECRC is appended at the end of the each block where the block size is larger which is actually equal to that of the maximum payload size which is getting transmitted here. For the 512 bytes, the 4 bytes is a mere small overhead which does not provide maximum performance improvements when disabled here and that is evident from the graphs which we could see. The ECRC overhead will be significantly observed in the CPU based transfers as the burst size of CPU will be less when compared to the company. For instance, if the burst size of CPU will be 4 bytes, the appending of ECRC which is additional 4 bytes occupies 50% of the data which bandwidth which we are trying to utilize here. So, the user can switch to the DMA based transfers if he wants to enable ECRC considering the fact that the channel, the PCA channel in which the data is getting transmitted is not secure enough. This will definitely improve the performance on the reliability of the system where ECRC enabling is proven to provide ECRC support until the transaction later. So, the next factor which we have in our list is the ESPM exit latency here. ESPM which is being part of the PCA protocol has specification which allows PCA controller to enter into low power state such as L0S, L1 and L2. These L0S, L1 and L2 provide its own power saving mechanisms where the cloud, the particular fine domains, everything has been disabled in a stepwise manner for different levels which is L0S, L1 and L2. Though this feature offers more power saving features, it also impacts the PCA bandwidth as there are latencies when exiting the ESPM state. In order to perform the data transfer on the bus, the ESPM exit should happen and only then the ESPM will be disabled and the data transfer will go on the bus successfully. The previous use case of network interface card which always keeps the PCA bus busy, the performance will decrease due to L1 exit latencies when ESPM is enabled for the network interface card case. Usually for the controllers, the default exit latency and under entry latency values will be made harder configurable and in our test bench we could see that it is configured to the value of 64 microseconds by default. In our test bench, this is a hardware configuration which can be modified only during the IP design and not in runtime by a software. Based on the analysis which you are about to do in the upcoming slide, the user can decide on whether to keep this parameter as a hardware configurable parameter or they can switch to software based configurable parameter where the software can control this entry or exit latencies and then design the system accordingly so that in case the system definitely requires an ESPM, they can switch to the lesser value which is lesser 64 microseconds here. We can see how ESPM is impacting the performance of PCA in the upcoming slides. So in the test bench, we have kept the maximum payload size to be 512 bytes and the maximum read required size to be the 4096 bytes. The Gen4 X4 link has been used for reference here where we were able to transmit data of size ranges from 1 byte until 2 megabytes for all the transfers, the bandwidth has been captured and it has been plotted here. We could see that the performance improves for the considerable margin between ESPM available and ESPM disabled case for data sizes ranging from 1 byte until 1 byte. From the graph, it is significant that the performance improvement is in the order of even ranging from 100 megabytes until 1 megabyte also. And shows similar performance after 2 megabytes which proves the fact that once the 2 megabytes size is hit, when the ESPM is enabled or not, it doesn't matter, the bandwidth gets saturated. So if the user is trying to perform a data transfer of sizes greater than 2 MB on the PCI bus continuously, enabling or disabling the ESPM case will have no impact here. If the data size which is getting transmitted is less, then the user can decide on whether to have the ESPM enabled in the user space application or it can be disabled based on the requirements. So we have getting to the last slide here. The last parameter which we are about to look into is the scale flow control and the maximum outbound non-posted request. The scale flow control and the maximum outbound non-posted request comes into picture only for PCIe gen post piece and outbound and it is not considered for the speeds below which is gen 1, gen 2 and gen 3. As we have a limitation of 127 header credits and 2047 data credits on the previous generation, the PCIe lead performance gets affected due to insufficient credits to account for the raw grid time. To address this, the scale flow control mechanism has been introduced in gen 4 which scales the maximum offsetting header and data credits by a factor of either 1 or 4 or 16. The designware controller which we have used provides a hardware configuration to enable this feature in the data link feature which is shared. The user who is about to consider this feature to be enabled or disabled can decide on whether this value can be hardware configurable or software configurable. Our test bridge had this register value configured to the enabled state during the hardware configuration stage itself. So whatever experiment which has been done until now has the scale flow control enabled here. The next factor which is the maximum outbound non-posted request specifies the minimum number of simultaneous outbound PCIe non-posted request in total for all functions. Based on the non-posted request, the SOC incorporating the PCIe can size the completion lookup table to identify the completion time. It also helps in configuring the completion header queue RAM when you configure completions in store and forward or cut-to-moves. Similar to the scale flow control, this is also a hardware configuration in our test branch. The user or the end application which can think of using this parameter as a dynamically configurable value, they can decide on during the hardware configuration phase itself whether to enable or optimizes value or to provide the control hooks to the software so that the software can take up and enable or disable in the runtime based on the usage requirements. We have reached the conclusion of this presentation. So we have explained some of the useful queue level registers such as max-way outsize, the max-rate request size, the ECRC overhead, the ASTM latency, and the link speed and the link width also. This can be treated in any way possible to achieve the desired result on the particular instance which uses PCIe for communication. Also, we can see the queue number register shows successful improvement and in-depth understanding on the hindrance such as finding the optimal packet size, the buffer size, the cellular latency, and so on. Since the experiment is done on ARM64-based platform, the future scope of this research can also be extended to other architecture such as EMP-X86 which will identify whether the airplane of CPU casts bottlenecks and whether the multi-threading or the operational frequency of the SOC is causing a bottleneck for the PCIe or not. Thank you all for attending the presentation.