 My name is Mansi Navre, and I work at Intel as a Graphics Colonel developer. So before I start my talk today, how many of you have played with Graphics Colonel before? Awesome. And how many of you know what KMS means? Wow. That's really awesome. That's not how I was when I started working on this project, like a year ago. I had no idea what it meant by display port compliance or how it fit into the graphic stack and what it meant by atomic or KMS. I was just totally thrown under the bus for this project. But today, I have implemented a solution for display port compliance, and I have it upstreamed in the Graphics Colonel driver. So I'm going to talk about my journey through this project and how I took those baby steps to achieve this. So has this ever happened to anyone before that it's a Monday morning, you go to office, hook up your Linux machine to a display port monitor, and there's no display? Has anybody seen that? Always, right? Or you're trying to watch a movie, and you connect your laptop to the projector, and all you see is this flickering, ugly screen. It's bad. It's really frustrating. So my goal was to fix all those problems with display port by making the Intel Graphics driver display port compliant and upstreaming the solution so that all of you can use it. So there's a famous saying, right? A journey of 1,000 miles begins with a single step. So being a total newbie into graphics development and open source development for that matter, I knew this was going to be a long journey for me. So I broke down this task into several subtasks and just started taking baby steps. So first question I had in my mind was, what is display port compliance, and why is it important for Intel? So let's look at an example. Let's say the first user has an Intel laptop connected to a monitor. The second user has an Intel desktop connected to a monitor, and both these connections are made over display port cable. So what do you expect? So if both these Intel devices are running the same graphics kernel driver, they have the same display port driver, then it should behave the same. It should give the same end user experience, right? So this is possible. This interoperability is possible if the display port driver actually is implemented according to the spec. VESA spec is a standard that defines this digital interface for display port connections. And there is a detailed test procedure which you can run to actually make sure that the driver complies with the spec. That is called as the display port compliance. And it's extremely important for our customers that we actually follow the spec and have a driver display port compliance so that they can get to that user experience for their customers. So let's see what happens when you connect a display port cable, right? So the device which is sending the data in DP world is called as DP source. The device on the other end could be a monitor or projector which is receiving the data is DP sync. What happens when you connect the DP cable between the source and the sync device? The first signal, the very first signal that the sync device sends is a hot plug detect signal. It's just an interrupt request signal going to the source device. So the source knows, hey, something new is connected. What does it do? It initiates these DPCD reads on the aux channel. So DPCD is display port configuration data register in the sync device. And it tries to read this information. Through this, it's going to read the receiver capability like the link information or edit data. And once it has this information, it kind of does this calibration routine to negotiate the receiver capabilities between the source and the sync device. Once it's done, it's ready to start sending out the data on the main link. So let's dive a little bit deeper and understand the most important concept about establishing this display port connection. That's link training. So when we say display port link, what exactly are we talking about? This actual physical link can have multiple lanes. It can support up to four lanes. And the channel quality, the physical layer capabilities, determine how many bits per second can you transmit on per lane. So that's called as the link rate. And then the calibration routine, which I just talked about, is the handshaking protocol that happens between source and the sync device to actually configure these link parameters. And that's called as the link training. Let's see what happens in the link training. So first, you connect the display port cable. You get the hard plug detect signal. That's when the source device will read the maximum lane count and the link rate supported by the sync device. And then it starts the link training because it has to still figure out what parameters are going to work for the actual cable. So the first phase there is clock recovery. In this phase, it will just start sending out the known training pattern sequence onto the main link. And on the receiver end, these symbols are used to extract the clock information and know if it can actually work at that specific link rate. So that's the clock recovery. Next is channel equalization. So this is the phase where the receiver end will try to understand the mapping, how the data is actually mapped onto several lanes. Once these two phases are successful, then we say that the link is ready. What that means is now the link is set to get the data at a specific lane count and specific link rate. And that's when the link training is successful. So this is very important part of establishing any connection. If link training fails, then basically you will not have a stable link. And that's when you see those black screens or flickering displays. How was I going to test this compliance? So I use this equipment called as DPR120. It's manufactured by Unigraf and it's certified by VESA to run these compliance tests. This device is hooked up to the device under test, which could be your laptop running the graphics driver. And the output DP cable is connected to the DP monitor. So this basically acts as a referencing device. And it can run the compliance test suite. So for each test, it's going to request a specific data or a video pattern on the device under test and throw those outputs to the DP monitor. And it basically sits there and taps those values on the OX channel. So it's basically just monitoring all the transaction onto the incoming display cable and on the outgoing display cable. And it tries to compare that data to the reference values. If it matches, then we say that the device is compliant. So that was my test setup for testing if our driver was actually display port compliant or not. But I really needed to figure out how it actually mapped to our graphics stack. And what were some of the areas in the graphics stack? I would need to modify fix for that matter to get this compliance thing going. So first is a hardware, the Intel integrated graphics device. That's the hardware that is rendering the display and acceleration. The component right on top of this is the Linux kernel. So this is where the Intel graphics driver sits, the i9-15 driver, which knows everything about our hardware. And it's going to configure the hardware according to the commands that it gets from the user space. On top of that is DRM, its direct rendering manager. It's also part of the Linux subsystem. And it's going to implement the part of the kernel that is common to different hardware specific drivers. And then it exposes the APIs to the user space so the user space can actually send information all the way down to the hardware to request a specific mode or a specific display that it wants to render on the screen. All these components have to interact with each other in a way that the whole stack actually complies with the DisplayPort spec. And so that when you connect the cable, you get that perfect frame. Next thing I wanted to learn was what is really kernel mode setting, the KMS. I guess most of you know, but I really stumbled on this concept. It took me a long time to understand what it really meant when people say, OK, you connect the DisplayPort cable, that's when the mode set happens. So what is kernel mode setting? So let's start with the data that is in the memory. The RGB pixel data is in the frame buffer. The first thing that happens is it scans these frame buffers and that happens by CRTC. It's the piece of hardware that will get this RGB pixel data and generate the bitstream according to the video timings. Then it sends this data to the encoder. So encoder will take this bitstream and generate the analog signals based on the type of connector that is connected. So it's the piece of hardware that will decide whether it needs to generate DVI signals or HDMI signals or DisplayPort signals. Then it goes through the connector. That's the actual physical connector to which you can connect a monitor. Or it can be an internal, in case of embedded displays, it can be an EDP panel. And then finally, you light up the displays and you see the picture on the screen. So this process of setting up the clocks, scanning out the buffers, initializing the chip, initializing the pieces of hardware like the CRTC or the encoder, and finally lighting up the display is called as kernel mode setting. So why is it really atomic? What is atomic KMS? And I call it the two-step. This was another concept which took me a while to understand and join the dots and see how it really mapped to that kernel mode setting. So again, going back to what happens when you connect the DisplayPort cable between the source and the sync device. So that's when the user space will create a list of parameters or the list of properties that it wants to change on the hardware. And then it sends it out to the kernel through this iOctl call, the DRM iOctl mode atomic. It's a single iOctl call where the user space is going to send all this information to the kernel. Now kernel here is responsible for the two-step process. It's going to implement two operations here. The first step is the state. It's the atomic check phase where it forms the state of the device. It will form the state structures for these different DRM mode objects for the plane or CRTC or the connector. This is where it's actually going to validate the mode that is requested by the user space. So if user space is requesting a specific mode or 4K mode, this is where it's going to see whether it's capable to render that mode with the hardware limitations or not, or is it too big for the screen. Once this phase is successful, then it goes to the next phase, which is atomic commit. And this is where it will actually send all the data to the hardware. This is the one step where it writes everything to the hardware. And the expectation is that this step will always succeed because we have already checked or validated the mode against the hardware limitations in the previous step. So at this point, I had understood what it meant by display port compliance, got a little bit of idea about how the graphic stack works and what is atomic, what is kernel mode setting. So I thought, yes, I got the ball rolling. I was there. I thought I figured out everything, and it's all going to be piece of cake after this. But wait, did I? So I got very excited. I was like, OK, I'm going to see if it passes compliance. Then my job is done, if everything works. So I hooked up my laptop running the graphics driver to the DPR120, connected it to the display port monitor, and ran this compliance test suite. But guess what? Murphy's law is true. So the compliance tests failed. So that meant I had to do more investigation and more coding to figure out what was going wrong and fix it. So I did some more investigation, looked into the code, and found what was the problem. So let's look at what exactly the problem was. Going back to the areas which come into picture when you connect the display port cable. So there's a sync device, which is your monitor, to which you connect your display port cable. It sends the hot plug detect signal to the kernel. The user space then requests a specific mode to be displayed onto the monitor. Let's say it's requesting a 2K mode. Now the kernel here is going to go through this atomic kernel mode setting that we talked about. So it's going to go through the first step where it validates this mode. That's the check phase. And then it's going to go to the commit phase where it actually writes all the data to the hardware. So this is the step where it's going to implement link training. Because in the link training, you have to actually write the data to the hardware. You have to send the symbols on the hardware. So it cannot happen in the first step. It happens in the commit phase. And that's the first time you are actually saying that, OK, if I want a link to be trained at 5.4 gigabits per second to actually get the mode successful, then it's going to happen in this phase. But it can very well go wrong. And it can fail because we have never tested the physical capability of the cable before. And what if we get a link failure? And that's when we get a black screen. And there is no way that we can send this information back to the user space because the commit phase is never expected to fail. And that was a problem. We get this big fat error message in the D message if you see the kernel log saying that, hey, link training has failed. But then what? It's a black screen. And it's a dead end right there. And you have this really unhappy user staring at a black screen. So somebody had to fix this problem. So this is where I came into picture. And I was like, OK, I have the knowledge. I have the community to help me. So I'm going to go and fix this. So let's see what the solution was. Sorry, I got the next click first. But OK, so the first half of this diagram is kind of the problem. And the next half is the solution that was implemented. So what happens? The user space is requesting a specific mode. And the kernel is going to validate it, do the link training in the commit phase. And let's say the link training actually fails at the requested link rate. That is 5.4 gigabits per second. So what does the kernel do? Now at this point, we introduced a new property for the connector, which is the link status. So as soon as the kernel knows that link training has failed, it's going to change this link status connector property and set it to bad. And then it's going to send the hot plug U event back to the user space. So user space knows that something is wrong or something has changed in the hardware, and it has to do something about it. So that's when the user space is going to request another mode set. It's kind of going to retry the mode set. And this time, it's requesting the mode at lower resolution because the 2K mode did not work. So we invalidate the mode at that point. And the user space will request a lower resolution. When the kernel gets this request, it will again go through the check and commit when it gets to the point where it has to retrain the link. And at this point, because it retains the information of the link rate at which the previous link training failed, it's going to try and retrain the link at lower link rate. And this actually works for the physical cable, right? So the link training passes. We have the good link status. And finally, we have a successful mode. And we have a happy user, right? No more black screens. It is lower resolution, but it's better than a black screen. So that is how we fixed this issue. And we have noticed that, OK, now when you actually hot plug the device, you connect a DisplayPort cable, you do not get a black screen. You do not get random error messages in the D message and just black screen or flickering display. This is just some outline of the things that were added. So we introduced the new DRM property called as link status property. It's attached to the connector. Then there is a helper function to set the link status property, which is used by the i9-15 driver. And then there's a helper function to get this property so that user space can read the link status information. And all these helper functions are added in the DRM layer. So if you are working on this on a different hardware, then you can very well use these helper functions and make use of this new link status property. If you want more information, this code is available in the upstream i9-15 driver. So you can go look it up and play with it, maybe fix if there is something is broken. So why really do we need this asynchronous reporting through this property? Why can't we just fail the atomic commit and send a U event back to the user space? So atomic check guarantees the requested mode. So it's expected that atomic commit will never fail. But link training is an exception, because we have to do link training in the atomic commit. And that can fail because it totally depends on the actual physical cable. The link might actually fail after a successful mode set. So what that means is, OK, you've connected the cable. First time you get it all going, it's all happy. But something goes wrong in the cable later and it can fail. So you need a way for the kernel to send back that notification to the user space that something has changed with the hardware. Then atomic allows non-blocking commits. What that means is you can return the control to the user space without actually finishing the mode set. So then what if the link training fails during or after the mode set? There's no way that user space is going to know without having this link status property. So this was a solution that was implemented. And I thought, OK, I'm done implementing the solution. But the biggest problem now was getting it upstreamed. So that was kind of good and bad part. That was a fun part because there are developers all around the world looking at my code and giving me feedback at every stage of it. But it, of course, came with some of the upstreaming challenges. It took me a really long time to convince the community to tell them about this design and that, OK, it's going to fix the problem. It's actually going to pass the compliance. And this is what it is needed to fix the broken link training algorithm. So I came up with some of the rules that helped me a lot to get these patches upstream. And I'm sure that if you're new to open source development, these rules will help you too. So the first rule, right? I call them as Linux's rules. So first rule, no regressions, right? So that means no GPU hangs, no black screens. If you submit a patch, that should not break the code. And it should not result in a black screen or regressions. So that means the review cycles can get really aggressive. This is how a typical review cycle looks like, right? So up there, you're a developer working on your code. The first thing you do is submit patches. You submit the patches to the public mailing list. It's called it's Intel Graphics mailing list, but it's a public mailing list. There are developers from outside of Intel looking at this and giving you constant feedback. So you get lots of review comments and you submit new patch revisions. This keeps going on for a long time until you finally get a reviewed by. Yes, and that's the day when you go party, right? So you get a reviewed by and your patch gets merged into the DRM Intel tree. So that's an internal tree where it gets merged. And there's a lot more testing that happens there. And from there, it gets pulled into the DRM tree. That happens on a weekly basis, but it can go into DRM fixes if it's a fix for a known issue or it can go into a DRM next, which means that it's a feature for the next release. This is the tree where Linus is looking at, right? So Linus actually pulls the patches from this DRM tree on a weekly basis and announces its release candidates. These are these intermediate releases where you're collecting the fixes or patches for the next revision. So it goes through the cycle of release candidates for a long time until it's stable. And then finally it becomes part of the next Linux release. So it's a long process from when you submit the patch and when it hits the next Linux release. So it can take a really long time. And the solution that I talked about, it took me probably a year to get those patches upstream. Next rule, right? Never blame user space. It's always Kernel's fault. It's kind of sarcastic, but it's also true. Because Kernel is the component sitting closest to the hardware. So if hardware doesn't behave as expected, then the kernel developer is the one to blame. And this can actually get you into this chicken and the egg situation. And I was in this problem for a very long time. So because if your solution is implemented in the kernel and it actually also impacts the user space and requires some changes in the user space, then what do you merge first? You can't really merge the kernel patches until you have tested that user space works with that solution and that you're not breaking something in the user space. But you can't merge user space because the kernel patches have not yet landed. So this is a really complicated and it's very frustrating. So what did I do? So my solution of implementing the link status connector property did have impacts to the user space. Because I wanted the user space driver to change and make use of this property so that they can actually request this new mode set. And it can retrain the link. And finally, we can get a successful mode. So first I submitted my kernel patches. I got reviews from a lot of peers. And finally, I got a reviewed by. I got a lot of acts from Intel from people outside of Intel, which was good. So the design was approved. And it was a sigh of relief that, OK, it's not breaking anything because a lot of people tested it. And then I started following up with the user space community. I made sure that they made the changes, tested this new property with the user space, and that it was working as expected. And made sure that they submitted the patches. Then we waited for a long time to get reviews from the user space community, finally got the consensus from the user space side. And now the patches were ready to get merged in the kernel. So it took a really long time for implementing this solution because it was a huge ramp up. And then it took a very long time to actually get those patches upstreamed. So that was the end of my journey. And I literally felt like this. I felt accomplished on top of the world when my patches finally got merged after a year of following up with the maintainers and pinging them on IRC and bugging a lot of people to review my patches. But there were a lot of things I learned along the way. It was definitely a steep learning curve. If you start working on an open source project, especially the graphics kernel project, it takes a very long time to get expertise on certain part of the driver. And then when you submit patches for that part of the driver, you have to make sure it's not breaking something else in the driver. So you can't really go on and keep playing with other parts of the driver, but you have to make use of the community. You have to leverage the knowledge that they have about other parts of the driver and get feedback from them to improve your code. Submit patches. So don't be afraid of submitting patches and opening it up to the larger audience. I did that mistake. So I kept working on my code. It was all in my internal tree. And I had the compliance app running there. And I used to go to my manager and say, hey, it's all working. It's awesome. I'm done with this project. What can I work on next? And he's like, OK, is it upstream? Said, no, I haven't even submitted it to the mailing list. And that was really the beginning of the journey. The day when I submitted it to the mailing list, that was the beginning. Because after that, I think I worked probably 15 or 20 revisions of the patches. It got, yeah, people started nitpicking on smaller things, which is good because it has to comply with the other design of the driver. But that's your first step outside of your comfort zone. So don't be afraid of doing that. Feedback is always constructive. So don't take it as criticism. And I would say, don't take it personally. Because in open source development, you're submitting your patches, and it's getting reviewed by people you have never met before. You don't even know what company they work for. You can't go to their manager and say, hey, he's giving me this nasty feedback. I don't want to work with him. You can't do that. So yeah, I mean, I got a review saying, this sucks. It's going to break link training. It's very fragile. Don't touch that part of the driver, right? So yeah, it was frustrating. But then the more I kept looking at the review comments, the more I came to know that it's for my benefit. So just follow up with the review comments, ask questions, ask them, why do you think this is going to break the code? Or what do you expect? How can I change it? And take the feedback constructively. So yes, you will see the finish line, finally, right? Your patches will get merged. You just have to follow up and keep improving your code, keep pinging maintainers on IRC or developers on IRC. And don't give up. Your patches will get merged. Thank you. So these are just some of the links where you can find the code in the upstream I915 driver. There's lots of documentation on 01.org. If you have questions about KMS or about these properties or how the atomic infrastructure works, you can go look at the documentation. Or that's my email address. So shoot me an email or review my code. That would be the biggest help and contribution. So thank you. Sorry? Yeah, yeah, sure, of course. Sorry. Yeah, so this was kind of a new feature, right? So the thing I learned is when you're implementing a new feature in the open source community, writing a design document never works. So you know, because nobody has time to go look at the design document, I think that's the closed source way of looking at things. So if I want to implement a new feature, what I have done is, you know, just I'm thinking about it. I know that I have got kind of, you know, the design for my new feature, but I'll implement it in the code, even if it's rough, right? Even if it's rough, I'll implement it. Maybe I haven't tested it. And then I'll submit it as a patch, but I'll put RFC in there. So that's just request for comments. So that just, I think when you submit it as a patch, that's when your code is gonna, you know, that's when developers are gonna look at it. That's when they're gonna think, okay, you know, it's actually being implemented. It's interesting. Let me look at it. Let me review it. And you don't have to worry about whether it's gonna break the code because you've submitted it as RFC. So it's not gonna get merged and it's not gonna, you know, break something. So you just get people to look at your design, get their feedback, and that was the first step for me to just get people to look at it and, you know, just to understand that I was on the right track. It wouldn't. And that's the thing because, so the problem there is if you fail at a specific rate, you cannot go and try different rate just right there because you haven't set up your clocks or CRTCs for that rate. And that is the reason you have to do the entire mode set again because the setting up of clocks and CRTCs happens in the previous phase. So if you just fail and commit, you can't go and redo a link training without going through the previous step. And that's the reason you have to send it back to user space and say, hey, you know, something failed, now redo a mode set. That was the, yeah, that was inherent problem there. Okay, so when you're trying with the link rate in the previous iteration, so you have found the link rate that is, that's the minimum link rate that is required for that resolution, right? So that means if you try lower resolution and the same link rate, then it's probably, I mean, you know that it failed, so you have to go to the lower link rate. And you need lower resolution because otherwise you know that it's not gonna support it anyway. I think I'll take one. I think so, yeah. So right now, this is the only exception in atomic commit because all the other errors get caught in the previous stage where we are just trying to validate it. And there are a lot of failures that get caught there and we send the notification back to the user space. But if we find something else that, you know, you need to, you're doing something with the hardware for the first time in the commit and there's a possibility that it will fail, absolutely. You have to do this approach because you can't just go and fix it there because then you're breaking the entire display pipeline. Yeah, yeah. I think that's the only way to go about it, that create a property and then send that information back to the user space because there are helpers already that can get the property. So if we add a property for the connector or CRTC, then yeah, we will be able to implement it similar, right? You had a question. The link training, I read the DisplayPort spec. DisplayPort spec talks about what is supposed to happen. The first thing is hot plug detect and it's supposed to train the link at specific rate and it actually goes, walks you through the whole algorithm. And that's when I realized, hey, you know, a driver actually does not implement what's stated in the spec because we were not following the algorithm. We were not lowering the link rates and we were just trying it once and that's it, failed. You don't have to pay. You just subscribe and you get the spec. Yeah, it's open. 4.10, 4.10, yep. Yeah, it took a long time and I think when I started working on it was last year, January, and yeah, it was really old. The Atomic was not even implemented. So there were, you know, stages, like I first tried this with legacy user space drivers but then because the whole trend is moving towards Atomic, then I made this property to be an Atomic property so that it can be used by the Atomic user space drivers and then tested it with the Atomic driver and yeah, it's still not merged in user space yet. We have tested with the SNA driver, mode setting driver, and so you don't really have to have that change in the user space. It's not gonna break anything but if you want to improve or if you want to fix the black screens then you need the user space to make use of this property. So it has to use the calls for get property, get this property value and actually redo a mode set. It's in the mailing list. It's in the mailing list, you know, I'm every day. I am on IRC asking what's, you know, is it gonna be reviewed? So it has an act from the maintainer and we're just waiting for somebody to sit down and actually review, you know, line by line and give me a review by. Oh yes. Yeah, so EDP, the difference is, so the similarities, it's the same process. You get the hot plug detect signal and we read the DPCD registers. There's a different set of registers for EDP but for EDP we only link train once because it's an internal connector. There is no way that it's gonna break later or it's not part of the mode set. So we train only once and if something happens with the internal connector then it's gonna send out the short pulse and it's gonna retrain at that time. So we don't really, if you look at the spec it doesn't really ask you to retrain at lower link rates and go through the same algorithm for EDP. But EDP compliance, we haven't really implemented any tests for EDP compliance. We'll have to maybe follow up with Unigraph if they have EDP compliance. Test suite with the DPR 120. But I can check that for you and get back to you if you want. Video underruns won't be because of the link training but it's definitely that you are validating the state and configuring something in the atomic check and then you're probably under utilizing that pipe. So the CRTC that we talked about, we call it a pipe. It's gonna take the frame buffer and then send it to the encoder. So if you are configuring the pipe wrongly and then trying to send less data when the pipe is running too fast or trying to send the data which is slower and the pipe is running too fast then you're gonna see underruns. So it's probably something that is wrong in the configuration part in the atomic check phase. So that's something you would have to look into your driver and see whether you're configuring it correctly for the requested data rate and things like that. Any more questions? Okay, thank you.