 Hello everyone, I am André Almeida and I work for the open-source consultancy Gali and I'm here today to talk about GPU Resets. So I am a current developer, I'm working on the Steam Deck and the Steam Deck is like the source of this work. So that might have happened to you, you are playing your game on Linux and then something wrong is sent to the device and when you are playing something on Linux there are many layers of translation and drivers so it's very easy to get something wrong on the stack and then the GPU hangs and it's over for you. You just have a black screen and you need to reboot your machine. But in some cases if you're weak enough the system will recover itself. So that happens because modern GPUs are really complex like very, very complex. If you get the numbers from the last AMD GPU there are like a lot of parallel things happening, a lot of computer units, transistors and so and shaders are turned complete. So that means that you can't guarantee that that code will run, that will not be looking forever and so on. So yeah, GPUs are very complex nowadays. For instance, this is the flow for, this is just the display unit for AMD GPU and you can see that there are a lot of components here just for the display part. And for instance if you have an infinite loop on a normal CPU program that's not so bad because we have some abstractions on the processor and the memory and you can just kill the app and not kill your entire system. But if this happened on the GPU that infinite loop can like hang the display engine and you won't be able to update the display again. So you really need to get there and kill the process and reset the GPU. So now I'm going to show how we detect GPU resets from the hardware until it reaches the application. So basically on DRM, on the Linux kernel drivers you submit jobs to the device and then later you can check if your memory fences have been reached or if the timeout had expired and then the Linux kernel driver can do the reset if it finds out that this job is taking so long to run it might be stuck, you can't even be sure of that. But you take that is stuck and then you do some action on that. So the drivers does a GPU reset and we have a lot of different reset types on the GPU. We have like more soft ones that you only reset the engine that is stuck or you reset the full device for instance you have all range of that and more complete resets are more destructive. So you can lose all the virtual, all the VRAM of the GPU you can lose everything that all the applications had built so far. But yeah, this is how we do. So the driver checks what's going on and try to do the last destructive reset available. Okay, so now that the kernel knows that a reset happening we need to tell the user space mode driver that something wrong happened, but DRM has no API for that. So if you look on the kernel side you see that each vendor implements something different for Intel, for AMD, for Fedrino, they have different IOC controls for doing that but this is not really something hard specific because if you read the code of all those operations they do stuff that is very similar but in different ways and they are not very complete. So for instance the AMD GPU has a version 2 but we might need version 3 but in the end I would like to have a standard for that. Okay, so now we told Mesa that something bad is going on and now Mesa needs to report to the application. So on the graphics API we have ways to tell that to the application. For instance in Vulkan we had the device lost error and this device lost error is kind of generic, it's not just about GPU resets it can be about for instance hotplugging. So if the application gets device lost error it should assume that there is no device to rely on so we need to recreate all the context. And for OpenGL we have this extension for robustness and so in the Mesa side if the Mesa sees that the app hasn't implemented robustness we need to just scale the app on a reset because it assumes that the app can't recreate the context but because it can't check if the GPU was lost or not. So now the application got the message that the GPU was reset or lost it can recreate the context and now the user space graphics stack will work again. So as I said before we don't have a standard API for talking to DRM to Mesa that something bad has happened on the GPU and one thing then trying to propose is to get operation for that for instance DRM get reset state, something like that. So that way if you want to implement a new GPU driver on the Linux kernel you just need to implement that and on the Mesa side we can probably get this on the common code like in Galium so every time a GPU resets the kernel know what to say to Mesa and Mesa know what to expect from the kernel to make both the life of developers from Mesa and kernel easier. And also I'm working on the commutation in the kernel to explain what the DRM drivers should do when a reset happens and what the Mesa drivers should do as well because as you can see for now it's like ad hoc everybody does that in a different way but would be way better if you get this standardized. So as I said each vendor reacts different to resets as I work on the Steam Deck my focus was on AMD GPU and it was very unreliable for some resets the most basic reset would get the whole stack crash and do have just a black screen and no responsive. If you're lucky you can access remotely and figure out what happened and Pierre Erich from AMD and I fixed that for KDE because Radon SI that's the Mesa driver for OpenGL wasn't following the spec for OpenGL robustness and the fix was very easy I think as well like five lines of code in the end so it's clear that we need more testing for robustness both on the Mesa and in the kernel and one fun thing is that like all other operation systems like Windows they are more reliable that but because they have a lot of control on the graphical stack they have like one compositor they can they have APIs for that on the user mode on the kernel side so they have a lot of more control and so it's more reliable but if we can standardize we cannot be as reliable as then and another thing that we need to fix on the GPU resets on Linux is that not only telling user space that their GPU reset has happened but also tell user space what in the first place triggered that reset because right now if your game crashes you go on the GitLab and say hey my game crashed and you attach like lines and lines and lines of log and people and the developers need to have a lot of hard time figuring out of all of that lines what causes the hang on the first place because you have a lot of information but you have no context of what was the GPU running at that time that crashed so this is something that I'm also working on to try to make better so GPU hangs have two main sources on MD it can be hard hardware settings so if you change the voltage the frequency of the GPU if you do that on a bad way it can hang the GPU and of course application errors like infinite loops and right now there's no way to distinguish from one another and you can see that this really bad you know because if you submit a bug report for a game it might not be the game itself that crashed but something else that decided to change the frequency of the GPU and crash it and so and about this reporting of GPU hangs ideally it would be very nice to have some way to deploy that that has no overhead so we can like deploy for all string that users and automatically every time a game crashes we can get the context of what crashed the game and send this to the developers so they can figure out one thing that I work at all was a new MD GPU inf operation that I called Qt app that you can capture data about what hanged the GPU I was like capturing the address of the buffer that was running in the GPU at that time and of course this callback needs to be platform specific because I was reading some registers but it's not very reliable because on the GPU hangs you don't know what the firmware will do we don't have a lot of control on that side so the challenge of that is how to get the right information correctly in a reliable way how can we put the GPU like in the bug mode because right now you have the bug mode for Mesa that will set a lot of fences a lot of barriers a lot of information that in one hand calls us overhead to the game and on the other hand given that we serialize the buffers in a different way we can even remove the bug that we are trying to find out because now the application is running on a different way that was running on the bug mode the bug mode that we have right now we can't deploy to all the users and it's not really reliable so basically the roadmap to have better GPU resets on the Linux stack is to have a standardization of how DRM reports GPU hangs to the user space and a standardization of how user mode drivers deal with that and was it compositor to do after reset we also need better hang log to find somehow to show which buffer calls at the hang because as I said before the GPU firmware can have total control of the memory of the device so we're not sure we cannot be sure of what's going on there we cannot rely on the registers and there also an API called Devcore Dump on the Linux kernel that you can use to dump something when the model crashes but it's not widely used right now but if every GPU vendor start using we can have a standard way to have a text file to send to distros and developers and of course we need tests a lot of tests on IGT tests on KUnit all sort of tests to figure out from bottom to top if the GPU reset is working but of course giving that is a GPU reset is not so easy to test because the GPU can rank and so it's not so easy to get this on a CI so yeah that was it from my side here we have some links if you download the presentation you can see the work that we have been doing on that both on the Linux kernel and the Mesa to make it more reliable this is my email address and we are high at Igale and that's it from my side do we have questions the microphone is over there the projector I think it's all you need to so I heard you were saying the GPU hunt and playing video games I assume it's like graphical computation but I thought GPU is not just used for games anymore since like AI training uses GPU heavily does your solution apply to those scenarios or yeah sure so the question was if this work is not just about the games but for other like compute uses of GPU right right because I saw you were there was some DRM stuff in the in the context I don't know if that's just specific to graphics or it's like in general no it's not really specific for graphics any kind of GPU work can hang the GPU and this API that I'm like proposing is generic is just to tell user space that something bad has happened to the GPU as I said I'm more familiar with the game use case I don't know how common is for AI or compute workloads to hang the GPU but it can happen as well so yeah this DRM interface is generic so you just need user space to use that check that and to do some action when it's get the message that the GPU was reset and given that Vulkan is also used for compute you can also I think this even this error code makes sense on this scenario as well so yes I think it would be very similar for both games and compute workloads all right thank you can you are you suggesting that I mean now like today are you able to detect it and then recover or are you just saying there's a problem and you're trying to find a solution for it we can detect it and we can right right now we can recover in the most of cases from the GPU hang you just need to tell user space that something wrong has happened and applications recreate all the context and start work again so this part of the work is working is good right now I just want to have a standard API for that but the part that is not working right now is how to tell user space what hang the GPU to tell this was the buffer that I was trying to use this was the offset this was instruction that I was the GPU was running and then it hang it so yeah this is the current challenge that I'm attacking right now so in the best case when you recover you're still not going to be able to hide it from the user they're still going to notice something happens on the screen yes right now if your GPU hangs the screen will freeze and then a timeout will happen and so this depends on the on your setup but usually is led less than 30 seconds I think it's all frozen and then boom everything he creates and then you get back whatever you are using but the good application probably won't be there anymore will be killed cool thank you okay I think that's it thank you very much