 Hi everyone, I wish I could have been here with you today, but unfortunately I couldn't. My name is Maxime Repa, I'm one of the Deer and Mist commentators, Deer and Mist being a git tree in Linux that takes care of all those small drivers, display drivers, that you usually see in small devices or embedded instances or drivers that no one really cares about or at least invests a lot. And I'm going to talk about a bit about the display stack itself and a few challenges I perceive in the display stack and hopefully how we can get rid of them in order to get a more reliable display stack. So, testing is KMS, the first thing we need to do is, well, define a bit why we need to test KMS in the first place. So KMS stands for Canon mode setting, it's basically the display API, actually the modern display API in Linux and it was originally started for like desktop grade GPUs like Intel, something like 15 years ago or something like that. And for the last decade, there's been a massive effort to make KMS development much easier and kind of consolidate a lot on the framework itself in order to make the introduction of new drivers easier. And thanks to that, we now have like more than 60 KMS drivers in tree and we get usually one or two new drivers a real release that's expanding still. And we usually have between 1,500 and 2,000 batches in trees, so the effort of development is fairly big. And that, both the effort to make KMS easier and the huge amount of development have made the former display API a bit of pretty much dead days. There's no new development happening at all. If it's maintained, it's only for bug fixes. And there's no longer any use case that was addressed by FB Dev that isn't by KMS these days. The only thing that keeps FB Dev a bit alive is its UAPI. And that's still a thing, but once again it goes away as well with new projects moving, user space projects moving to KMS more and more every year. And so it kind of became the standard with basically every driver and display-related feature not targeting KMS and not targeting anything else. So that effort is massive, but on the maintenance front it's fairly easy because part of that effort to make drivers easier has been to create helpers, helper functions in order to share as much as common code as possible and keep only the driver's device-specific details in the drivers which means that nowadays for a device that would be fairly simple you would usually get a driver that is below 500 lines of code which is fairly small for something that important and with as many features as a display driver. So now basically every driver is using helpers. So it's actually fairly easy to maintain because pretty much everyone is using the same code and the only thing that isn't shared have very well-defined set of semantics so it's not something where you have huge expectations or anything. The thing is there's a lot of features in KMS there will be a lot of use cases that are implemented by the code itself and so the amount of features is really huge especially when you add the features for display output themselves, not only for the API and so it's easy to overlook or underestimate some of them when you develop a new driver or maintain a new driver. It's also fairly easy to misunderstand some of the requirements so what are the locking, for example, requirements of the side effects that implementing one of the features may have and worse it's fairly easy as well to be unable to test them. So for example let's assume that we have that shiny new feature that is only supported by let's say Android or used by Android if your work doesn't have Android then it will be fairly difficult to test how your driver behaves when that feature is running because really no one expects you to port the entire display stack of Android to your new platform but then you don't have anything to test it with. And it's even worse because as you can see on the graph so that graph has on the x-axis the drivers currently supported as of 5.14 in KMS and on the y-axis the number of contributors with more than one patch every release on average since the introduction of the driver in Linux and the dashed line is at 5 and you can see that aside from the desktop-grade drivers so MVGU, i915, Nuvo and Radeon and a bunch of embedded-grade drivers like Exynos and MSM which are for Samsung and Qualcomm Most drivers usually have one or two contributors that are submitting a patch every release and that's fairly low bar so given that you don't have a lot of contributors on those drivers or at least some that are very active it's going to be very difficult for them to keep up with the rate of development of KMS but also to get sufficient knowledge to be able to know well basically everything I've been discussing before so you can't really expect each driver to be maintained by someone that will know the framework in depth and then you have the other issue which is that the hardware is barely accessible usually so the controllers themselves might not be easy to access they might be proprietary, they might have been produced a lot in very small quantities or not produced any more at all so it's going to be difficult to get access to a controller driven by one of the drivers but even if you do then you don't really have the guarantee that the platform you've been getting is going to expose all the hardware features that would allow you to test basically everything for example if you expect to test or wanted to test the HGMI support of that driver and you can only get your hands on the platform that has that controller but doesn't have an HGMI controller then it doesn't really help so we kind of end up in a situation where the people with hardware usually don't have the time and thus the knowledge to be able to test everything in depth and then you have a few people with a lot of in-depth knowledge but don't really have the hardware and don't really have the time really to be able to make sure that all those drivers are running properly and so for example you can end up in a situation where you fix one thing and then cover another bug because of one wrong expectation so it may just be misunderstanding and that's kind of what happened here so it's an example that happened to me when working on the Pi 4 or Raspberry Pi 4 sorry and is still ongoing so it initially started by me having support for an HGMI feature called scrambling which is basically a change of the transmission mode to reach higher transmission speed on the HGMI cable and it's done so enabling a scrambler and it requires the cooperation of the display to well, notifies the display that the transmission mode has been changed so it should adapt to the new one and yeah, so it's needed for modes that are above 4K or including 4K at 60Hz so basically every mode that are hyped these days and the thing is basically when you disconnect the display obviously it's going to lose its camera status so when you reconnect it then the expectation that the something at least sends again notifies the display again that it should change its transmission mode and so for that to work you obviously need a way to detect whether or not your screen has been connected or is disconnected and so you basically have two ways to do this in DRM in KMS sorry so the first one is the number one really it's polling so the call will poll for the display status every 10 seconds at least I think yeah I think it's 10 seconds yeah we'll ask the driver for the display status and the second one is based on interrupts but not all platforms have interrupts really and that's what was happening in the Raspberry Pi the Raspberry Pi's were not all capable of using interrupts to be able to detect the change of the connection status on the HDMI controllers so initially the driver was very non polling to be able to have a path consistent between all the Raspberry Pi's now with a Pi 4 we have the option to use an interrupt so the second pull request here is to be able to use interrupts to detect the hotplug status the change of hotplug status on a GMI and thus be able to react and not overlook any connection disconnection because otherwise if you disconnect it, reconnect it using polling chances are since polling period is 10 seconds that you would disconnect and reconnect it very fast and within those 10 seconds the driver wouldn't be able to tell an interesting thing happened after it was merged it turns out that some application apparently when they are notified that the display has been disconnected will disable the GMI output which makes sense but also will like to do some HGMI cc access which in our case made the CPU store completely because the HGMI controller was completely shut off and on the Raspberry Pi SoC when the CPU tries to access a register of the controller that is powered down then the CPU just stalls the whole bus free so yeah that made it crash so we fixed there was not really any expectation being documented anywhere that this was actually valid but it was so we fixed it and I did a bunch of documentation to where it was actually needed there were a few more instances where you had a similar issue and that's fine, right? except that we got another bug which is that when running in 4K if you put your TV in standby and then turn it on again you would get a TV that is completely black and that was happening before because when the TV is turned on usually it will toggle the hotbed line to basically fake the fact that it has been disconnected and reconnected the car with interrupt was able to detect it but since it was exactly the same display as before we didn't really change the resolution or anything so everyone was fine with the current configuration nothing changed and we were only sending the scrambler to the display when we were setting a new mode no one did and so basically the Raspberry Pi was sending data with the scrambler on and the display was expecting the scrambler to be off so again one of the expectations from the framework that wasn't really documented anywhere was that in this case so the way it works when you have an interrupt in KMS is that you notify the call and the call will call into a hook in your driver which is called detect which allows the call to get the state of the display and so in this case in detect if we detect that we have a mode currently running that no one changed that requires the scrambler and we went to connect it then we just send again the scrambler the scrambler status so that it's fixed except it wasn't at least partially this was working with some applications with the framework console as well but it wasn't working with Kodi for example and the reason for that was that actually when we the way we were notifying the call in the interrupt handler that we had an interrupt basically on the output line was basically just sending a call event to the user space that this device had been applied but it was never really calling the detect hook I was mentioning because we misunderstood the helper that we were supposed to be using and so it was working before on applications that were able to treat those notifications from the kernel because the first thing that they would be doing when they get notified that something's changed is to ask the kernel what changed exactly and so in this case detect would have been called but Kodi doesn't have support for those notifications so Kodi just assumed in this case that everything was just going the way they used to work and it wasn't working so to fix this one we had to change for helper that this time did call detect and in detect since the previous pull request we had a way to notify the display that the scanner had to be on so it was working until last week where apparently this last pull request created a deadlock when you're waking up a TV using a CC command apparently the kernel just deadlocks so I'm not really sure why at this point I haven't fixed it but yeah it just shows a huge number of side effects and interactions and yeah misunderstanding that could happen out of not reading the documentation well enough or missing documentation entirely or things like that that just creates a bug when you're just fixing one which is just a nightmare so these issues are actually fairly standard I mean having an HDMI output where you can run HDMI, CEC and using a 4K mode these days is fairly common and it's really a thing that should be reported by some kind of CI because you don't want to rely on people being able to notice you want to catch them as soon as possible and so we basically have two options there NCI so free desktop is currently trying to move to GitLab I did so for Mesa and they are trying to discuss it for the kernel but it's only discussion for the kernel at this point so it could be a thing but it's really not so it's not an option at the moment and then you have kernel CI that works really well but there's a lot of boot tests there's more advanced tests for poor regression for example networks this kind of things but unfortunately for the display itself you don't have any tests running in kernel CI so that's not really an option and more importantly so if we have a CI infrastructure it's fine but which tools should we run in that CI infrastructure so there has been a set of tools a test suite really which is called IGT GPU tools that initially was started with Intel by Intel and then made generic so that everyone can use it and so it's really extensive it covers both display and rendering so like OpenGL and for both features that are generic and vendor-specific there is now a policy in KMS to make sure that every new feature or API is merged with an IGT test which is great it only makes it more extensive and it's a great thing there's around 2,000 tests it's well maintained there's tests and documentations for pretty much everything and it's a documentation itself so it's really nice except that it has a few issues the first one is that it kind of has been started as something for desktops and it shows especially when it comes to how do you install it on a system and there the requirement is that you have a fairly big number of dependencies and some big ones like Kerro or Pixman I think so if you want to compile it it pretty much requires a real distribution you have to use something like DPN, Fedora or whatever you can't use a smaller footprint distribution or hook it into a build system like Yocto or Buildvote it requires a huge amount of work really and you can't really cross compile it either because it would require to cross compile a huge number of dependencies or have them in your toolchain which is how it doesn't work so there's been a push recently to provide Docker images which kind of helps it reduces the friction when you need to set up a system to be able to compile IGT but it doesn't really reduce as a footprint of IGT really and so if you compile IGT with the Dockerfile that is called builddbm minimal the smallest you can get basically it still takes around 700 megabytes which is fairly high for an embedded system so if we look at the typical embedded devices that are some of them supported by KMS drivers that are merged at this moment we kind of have if we take the lowest common denominator we have one CPU which is pretty good something like 64 megabytes of RAM BIOC, a GPU something like 128 megabytes of flash storage so it's fairly small and you can't really write to it easily so you can't really compile on it we might not even have a network and we don't have any the chances are you don't really have AGMi or DisplayPort but you'd rather have like a panel interface the things like Mypi DPI, Mypi DSI or LVDS or something like that so there's kind of a cultural clash here and you can't really expect IGT to be able to run on those devices another issue is that like I said there's something like 2000 tests or something and fortunately for us it supports what's called test suits which is basically just a list that is published inside the IGT source code of tests that is supposed to be run on that platform there's only three users of those test suits basically Intel and Raspberry Pi and the Raspberry Pi test suits being like really minimal, it basically just tests the Bandov specific properties and interfaces and that's it so it's fairly hard when you get into IGT and don't really know how it works but yes, you want to run on a gaming platform and even more so since you can't even run all of them since some are really slow so it would take a long time and wouldn't really work for CI and so yeah, you don't have either a suit that is kind of what's expected by every driver like you submit a new driver and you have that hypothetical test suit and all the tests on that test suit must pass in order for your driver to be the same and then you have the features in itself provided by IGT and so it mostly tests the user space API and the driver behavior so it will try from the user space to for example change mode or be notified or something and make sure that the output is the same but it doesn't really help when the driver itself doesn't know that something is wrong so for example, let's take the scrambler issue that we were mentioning when the scrambler is off on the display the HGMI controller and its driver doesn't have any way to tell that the output is completely blank on the display side for all it knows everything is working great, it sends its frame and everything is fine really and in this case IGT wouldn't be able to detect it and it's a case that is actually fairly common most drivers most panel interfaces for example are one way interfaces so you just send data and you don't have any back so as long as you are doing an okay job on the driver side the hardware will be fine even though you are completely misconfigured to time for example and the display doesn't show anything really so yeah, it doesn't really test the output, it just tests everything from the user space so there is two mechanisms that allows to mitigate that the first one is VKMS which is a KMS driver that is virtual as its name implies and so IGT can use VKMS basically test the core in depth and make sure that for a given commit it has the proper output since it will have the output back basically so that's really nice to test the core itself but you are not able to test the drivers this way and there is another feature that is a write back, that is a feature that has some controllers that basically allows to take the output before it's sent to the interface itself and put it into a buffer somewhere in memory and so you can get back the output as well except that that output will be before usually the interface controller drivers or in the HGMI case it would have been before the HGMI driver so if your driver is misconfiguring something write back would be fine as well even though the display would not be so it definitely helps but it's really more of a test to see if the driver behaves properly if the driver reports an error but not if the output is actually something same there's a device that can help here as well which is called the Camellium it's been made by Google for Chrome OS and it basically is a big board that connects to the other side of some interface and will be able to capture and retrieve and their CRC and to send them through the network if you ask for it so it's great to test the output there's huge downsides though the first one is it's very expensive and difficult to source only Google produces them so you need to have them it's fairly expensive to be able to set up a CI form using them for example it only tests a limited number of inputs so only HGMI, DisplayPort and VGA which is already great but doesn't support to test for example panels and interfaces it requires a network connection and like we said some embedded SOCs don't even have a network controller and it's fairly difficult to extend so for example it's running on an FEGA so in order to extend it it's only using the HGL and the tests are fairly minimal due to the difficulty to extend I assume so it's not ideal so we basically have huge blind spots in IGT it's fairly difficult to set up for part-time developers it's impossible to deploy on some platforms or devices and really we don't really have a convenient way to tell that the hardware doesn't output anything or at least output something that is going to be displayed so for example this camera I was telling you about unidirectional interfaces like RGB, RVDS panels so if we could do something in an ideal world or at least my ideal world it would need basically a tool that can be deployed easily on any platform supported by KMS so that's actually fairly standard platform it's X86 ARM and MIX I think with a size that is actually reasonable for those platforms so something in the 10 megabytes something like that, something that would fit into a system with less than 100 megabytes basically that can run without network that can be easily cross-compiled and so can be sent fairly easily on any platform that doesn't have any ramp up time and doesn't need any internal knowledge of the tool to be able to do its job and optionally it can test driver output with cheap hardware so I came up during the last year with at least one possible solution to that problem and this is the solution I have in mind basically the plan I have in mind is that we still need IGT more than ever it's still the full test tool, we definitely need to keep it it's definitely where all the tests should be and if you want to have a full test of everything it's there, we should definitely keep it but in addition to IGT we need a smaller tool which will be, like I said, easy to deploy tests all the drivers well no, sorry, all the KMS drivers can pass all the tests, something that is so before L2 and the CEC framework has a similar tool which are called V402 compliance and CEC compliance and the expectation that they have is that whenever you are submitting a new driver and before L2 or CEC the first thing that they will ask is that before L2 compliance and CEC compliance all the tests are running, this is something that we should aim for it can test like I said the output using relatively cheap, so something like $100 or something like that, hardware without network and can test actually multiple interfaces including internal ones so things like LVDS, DSI and those kind of interfaces and so I worked on a solution that is basically three components the first one is a tool that runs on the device under test the second one is an optional board to capture the device under test output and a tool that will run on that board that will capture and process all the frames that are output by the device under test so the tool that is running on the device under test in order to make it easy to deploy and still fairly compact it's actually a Rust application which has some great benefits the first one is that it's statically compiled with all these dependencies, the only dependency it has is on the C library and it can accommodate any major C library as well so you can for example compile it for a muscle if you want to Rust has a number of benefits here in addition to this the first one is that it doesn't require any runtime either, so it's basically just a binary you run it and it's done so I had to write also some libraries for Rust in order to test KMS from a Rust application using pure Rust but that works now, it's an application that is actually displaying a frame on KMS output using atomic mod setting so the newest version of KMS API and so at the moment the application is just displaying something and make sure that you have a lot of commits so it's taking a PNG file scales it up and processes it in some way we're going to see later and so this is taking 4 megabytes which is very compact too and definitely something where we want to be if we really want to scale down the size we can it's basically the default parameters at the moment and that application would also run all the local tests on the device if needs to be then we have a board in addition to it a prototype board which is basically built around the concept that well not the concept really but the so you basically have myICSI bridges for pretty much any interfaces and myICSI capture is fairly common across embedded boards these days so the basic idea is that we would have some base board that has the SOC that would basically be the capture around the application and the system to process all the frames and then depending on the output you want you would use a different myICSI bridge so for example let's say HGMI to myICSI it's what I prototype with and so you would just basically switch those bridges and the rest would stay the same and it's fairly easy to develop you just need an additional board you can get those boards for cheap on a system like Hoshark those bridges are available for pretty much any interfaces but for some reason I couldn't find any for myICSI which is supposed to be fairly similar to myICSI but who knows and so those bridges can be both even in small quantities so it's something that should be fairly easy to set up and so the prototype idea is basically based on a Raspberry Pi as well both Pi 3s and Pi 4s and a Toshiba HGMI to myPCSI bridge and finally on the capture side it runs once again a REST V4L2 application and it will run a configurable test scenario during that scenario basically what it will do is so parse that test scenario of course set up the bridge set up the EDID for the mode that we want to test on the capture interface once the EDID has been set up then the expectation is that the device under test will start sending data and those data will be captured through V4L2 and validated by the capture application that will eventually report whether or not the frame is valid and so the frame is valid based on a few assumptions and a few criteria basically so every frame sent by the device under test contains a header within the pixels and those headers will contain both a counter and a hash and so the frame validation at the moment is basically making sure that the frames are in order so the counter is never decreasing basically and the hash is correct so one of the major issues that could have arisen is that the hash of that frame would take too long but it turns out that the REST V4L2 application takes around 5 milliseconds to process a 1080p frame so at 60Hz we usually have a budget of like 15 milliseconds so we have a lot of headroom there but yeah so this approach works however there's a few limitations the first one is that the EDID set up will trigger HPD pulse of around 100 milliseconds so if we rely on the polling hotplug detection that we saw in KMS before then we won't be able to detect it and we won't be able to switch positions which is bad so it only works with devices that can rely on interrupt based hotplug then the second part is that validation is based on a hash and so this is fairly fragile and we won't be able to test some features like we won't be able to test color space conversions because we expect the frame to be unmodified between the time where we emit it in the application and the time where we receive it on the capture side we don't really have a way to send parameters from the capture board to the DS100 test so this could be we could for example imagine setting up a special vendor EDID that would contain some parameters like run the display for 10 seconds and then disable it wait for 20 seconds and start the display again that way we would expect the device under test to pass the EDID but yeah we could imagine another solution and the final limitation is that both my PICSI bridges and bolts that can capture on my PICSI at 4K are really rare at the moment so we will only be able to test for the moment a 180p at most which is already great but it's a bit of of an issue as well and ideally we should add other features like being able to integrate into CI empowerment add more local tests to the device under test and be able to validate a few other things like that the audio and cc support are sane and that the info frames for example on HGMI which is out of band informations that are sent by the emitter are correct for what we are supposed to have and then finally have other interfaces so thank you for listening I hope it was nice I'm really interested in getting any feedback as well if you have been waiting for something similar to happen, if you have different use cases that we would need to consider just let me know I'm here on chat so let's start the discussion thank you