 So, hi again. My name is Mansi Naware, and I work at Intel's NN-15 graphics kernel driver team. I've been on that team for about two and a half years, and today I'm going to talk about actually my very first project on the team which was DisplayPort Compliance. From the user's point of view, yes, it meant fixing the black screens with DisplayPort, which maybe it happened just now. I don't know. I shouldn't have because I have the latest kernel with the fixes. So, has anybody else experienced this before with the DisplayPort or Mini DisplayPort just a black screen? Okay. So, good. So, I guess you can relate to the problem that I'm talking about here. When I joined the team, there were almost I think 97 percent of the free desktop bugs that we had on the display side were because of the black screens or the hot plug issues and connect multiple monitors, and there's just no display. So, the goal of my project was to make the DisplayPort driver DP compliant and upstream the solution. So, let's start with the basics. So, what happens when you connect a DP cable? So, you have the DP source on one end, which is your PC and the DP monitor on the other end. So, when you connect it with the DisplayPort cable, the first thing that happens is the hot plug detect signal that the sync device sends to the source device. It's just a interrupt signal saying, hey, there's a new connection. After that, the source is going to initiate the DisplayPort configuration data register read rights, through which it's going to start reading the capability supported by the monitor. It will know the resolution supported, the link parameters that are supported by the sync device. Once it has negotiated these parameters between the source and the sync device, then it's ready to encode the data and start sending it on the actual cable. The very first thing that happens when you connect the DisplayPort cable is this negotiation sequence between the source and the sync device, which is called as the DisplayPort link training. So, yeah, the first signal is the hot plug detect signal, then the source will start the link training. First, it has to go through the clock recovery sequence, where it sends the known training data patterns to lock the clocking information on the DisplayPort receiver end. Then it's going to send another known training pattern sequence, but this time it's going to send it with the known skew between the two lanes. So, this data is used on the other end to find out the inter-lane alignment and get the mapping of how the data symbols are sent between the two lanes on the DisplayPort cable. After these two sequences succeed, the source and the sync device kind of lock a particular link rate and lane count and then they can start sending the data at that specific link rate and lane count. We say that the link is ready and that's when it will start sending the data and you see the display on the monitor. So, what was our plan for testing the DisplayPort compliance? So, the plan was to use the third-party device DPR-120, which is certified by VESA to run the exhaustive test suite, the compliance test suite, which makes sure that the driver is compliant with the specification. So, it basically acts as a reference sync device. You connect the device under test, the laptop to the DisplayPort input of DPR-120, and the DPR-120 is going to say, okay, display this specific data pattern at this resolution, and the laptop will start sending the data, DPR-120 taps the information on the DisplayPort cable, and it's going to compare that with the CRC values reference data patterns, and if it matches then, yes, it passes the test. So, a lot of these tests here try to verify that link training is going to pass in different scenarios. Like for example, if the first phase of clock recovery fails, then the DPR-120 actually is going to try to induce that failure and see if the device can recover from that and still send the data after the specified number of retries specified by the DisplayPort spec. So, the goal was to run all these tests and make sure that the driver was actually passing all those link training tests and it was able to recover from the failures. So, going back to the basics, so what was the existing state of the atomic kernel mode setting? So, what does it do when you connect the DisplayPort cable? The user space is going to form the list of parameters, the properties and send that to the kernel. Then the first step that the kernel does is forms the state of the device, it forms the state for the different DRM mode objects depending on the requested mode. This is the atomic check phase and in this phase, it's going to try and validate the mode that is being requested. So, in this phase, it will, let's say, get the 4K mode and it's going to see whether that mode is going to be supported by the hardware, by the available clock, by the link parameters of the DisplayPort cable. If it can actually support that resolution, then it will go to the commit phase. In this phase, it's going to take all the data and write it to the hardware registers. Since this is the phase where it does all the writes or the hardware update, this is where the link training is going to happen because it has to actually send the data symbols. That's when you're going to start seeing the display on the monitor. So, I thought, okay, the atomic mode setting is validating the state in the atomic check phase and it's then going to write everything to the hardware. So, because it's already validated, it should actually work and everything should just work fine. So, I thought, okay, yes, it's all problem is solved, but did we actually fix the black screens at this point? So, yeah, I ran the test suite and yes, sure that when DPR 120 tried to introduce the link training failures, the driver was not able to recover from the link training failures and it was just a black screen. So, let's see what the problem was. So, just a simple scenario again, you have the sync device connected to the source device. It sends the hot plug detect signal. Let's say it requests this mode at 60 hertz. So, the first phase is going to be the kernel is going to do, set up the CRTC, set up the pipe for specific configuration. It will start with the optimum link rate and lane count at this point and see if that supports the requested mode and configure the pipe for that specific link rate and lane count, but it hasn't actually sent any symbols on the cable. So, it hasn't validated whether it's going to be able to send those symbols on the cable if it's going to work or not. That's going to happen in the commit phase where it actually sends those training pattern symbols and it goes through clock recovery and channel equalization. What happens if at that point the link training fails, and that's when you see a black screen. So, the existing state of the driver at that time, it would just get a black screen and the D-message logs would just say, okay, error, link training failed. But there was literally no information going back to the user space or the kernel wasn't doing anything to recover from this mode failure, and that's when we were getting those black screens. So, what did we do? How was this fixed in the driver to actually be able to recover from such failures which happened only at the last stage of the atomic kernel mode setting? So, we were here, it requested a specific mode and let's say, first it tried to link train at the maximum link rate and lane count for 5.4 gigabits per second and four lanes and that failed. So, once that failed, so we introduced a new property for the DRM connector, which would indicate the status of that specific link. So, at that point, the kernel knows that link training failed at 5.4 gigabits per second, four lanes. So, it's going to immediately fall back to the next lower values of the link rate, which is HBR which is 2.7 gigabits per second and four lanes. So, first it will go down, fall back the link rate and try again and it keeps doing that until basically the link training succeeds. So, it falls back and at the same time, it's going to set the link status property to bad and then send the U event a notification to the user space saying that, okay, something went wrong in the configuration and that specific requested mode did not work. So, at that point, user space retries the mode set. It retries the mode set at the same resolution first, but then it might happen that because we fell back to the lower link rate and lane count values, it might prune the mode. So, at that point, the user space has to get the connector information again, get the new modes and then try the mode set at the next available, next lower resolution. So, at this time, the kernel gets the mode set request, it's going to go through the same phases, but now this time it's training at the lower values and the link training is hopefully going to succeed and we get a successful mode. So, this actually happens for all the combinations, it does for the link rates all the way to RBR, 1.62 gigabits per second. If that doesn't succeed then it starts reducing the lane count and it goes all the way to one lane. So, it basically tries all the combinations and keeps reducing the resolution until it can show something on the screen which is, even if it's really small resolution, it's better than the black screen. So, why was this needed? So, the basic loophole here in the atomic kernel mode setting was that the failure is always an option and it was assumed that the atomic commit phase will always pass and the mode was always guaranteed at that point. But with link training when we are dealing with the actual hardware it guarantees the requested mode. Yes, but at that point it's only checking against the GPU parameters, but not the actual physical cable. So, link training can still fail and the atomic commit can still fail. So, we do need to handle this case. So, this link training failure is asynchronous, because the link might be working and ready in your transmitting symbols, but it can fail at some point when it's up and running, displaying something. So, you need a way to asynchronously send some notification to the user space at any time. It detects that the link is not working correctly. Also, it helps because atomic allows the non-blocking commits. So, what that means is it's going to do the mode set and return the control to the user space, but we don't know whether the atomic commit has completed. So, the user space still doesn't know if it has successfully been able to display something and the atomic commit phase is successful. So, in that case, also we need some way to asynchronously notify this to the user space. So, this is how we basically tested the entire stack and made sure that the newly introduced property and the way of handling the atomic failures was actually validated through the entire stack, and the whole stack was able to recover from this mode set failure. So, we use the DPR-120 device to induce those failures in the link training. So, first it sends the long pulse, it requests a specific mode. The kernel at that point is going to validate the mode link train at a certain rate. Once it fails, it's going to fall back, set the link status property to bad and send the U event. At that time, we did make changes to both the XF86 Video Intel and the mode setting driver to actually handle this newly introduced property. So, it keeps looking at the link status property, and as soon as it sees that it's bad, then at that time it's going to try, it's going to request a new mode set at the same resolution. But like I said, that resolution might actually get pruned because now the link parameters have changed, and it might not support that same resolution. So, it sends the XRandar event up to the desktop environment, and Martin actually wrote this app AutoRandar to constantly listen to those Randar events, and so every time it gets that event, it's going to reprobe the connector, get the mode information, and then redo the mode set at the next available resolution. So, this was the way we were able to test it across the stack. All the changes are upstreamed for i9-15 as well as the X server, but I still need to connect with the GNOME KDE folks to do these changes in the desktop environment because otherwise the whole stack is not going to be compliant at this time. I also wrote a tool. It's upstreamed in the Intel GPU tools. Well, GPU tools now. So, it's a DP compliance tool for fully automating this compliance testing. It does need having DPR-120 connected to the DUT, but after that it just gets the test request, handles the test request from the DPR-120 runs the entire suite, and it's going to keep the log of what tests are passed and what failed. So, that was a good way for us to do our pre-merge and post-merge testing at this point. But one of the future steps for us is to move this whole testing infrastructure to the open-source hardware, Google's chameleon board. With that, the idea is we will be able to test it with lot more corner cases, with different types of external displays. So, we will replace the DPR-120 with Google chameleon board. We would have to move the entire compliance test suite onto the chameleon board, which is going to be a big effort because it has to be certified by Vesa to say that, okay, this is a reliable test suite that you can run and say that the driver is DisplayPort compliant. Then eventually, the goal is, once we are able to move everything to the chameleon boards, we can have that as part of our CI system, so that for every laptop with a DisplayPort, we have this connected and we always run the compliance suite. So, anything that any patch that gets submitted, it will run against the suite and we'll make sure that it's not breaking the link training. So, that's all I had and this was definitely not a one-man project. I wouldn't have been able to finish this without help from the community, from reviews from everyone. So, yeah, thanks to everyone for reviews and suggestions. Thank you. Any questions? So, what Linux versions do you talk about here for 15? So, what version does it have to fix? Oh, so it's in the 4.12 and X server, yeah. Yeah, it's in the 4.12 and X server, yeah. What's the real answer of this? So, it is not there on Valend, so that's one of the goals to actually scale the solution to Valend and even on Android Surface Linger. So, because until then, this property is going to be there, but if the user space does not look at the property and does not handle it, then it's still going to basically fail at that time. So, yeah, I'm trying to reach out to other user space communities and help out to scale this feature. So, yeah. So far, we have just ordered the Google Chameleon Boards. We have the budget for that, right? Yeah. Yes. So, but no resources are allocated to that. So, yeah, it's going to take a while, but we will work on that in 2018. Again, I would need help from the community. I know, Leutpal from Red Hat, she's done a lot of work with testing DPMST and other corner cases with hot plugs using the Chameleon Boards. So, I'm going to reach out to her and, yeah. Yeah. Anybody can order it. It's, I think it's, and replacing that, the DPR 120 with Chameleon Boards is way cheaper because DPR 120, the licensing fees are like ridiculous $5,000 or something. So, we definitely want to, so we definitely can't have more than two DPR 120s. So, with DPR 120, there's a no way, there's no way we can integrate that with CI. So, we would need a solution on Chameleon Boards. DPR 1.2 and 1.3, we haven't tested DPR 1.4 yet. So, DPR 120, they have added new tests for DPR 1.4 support or they are adding, but we haven't ran that compliance test we did. Yes. Yeah. For the newer platforms too. Okay. Thank you.