 Hello and welcome to my talk addressing Weyland robustness. My name is David Edmondson. I'm employed by Blue Systems. I've been working on plasma for over 10 years and over the last few years I've been trying to get more involved in their Weyland ecosystem just through plasma and through our entire stack elsewhere. So Weyland. Weyland is one of the biggest transitions we've faced as the Linux desktop. It's a lot of changes in lots of different areas and it's important that we do a good job. It's not important that we do it quickly, it's important that we do it very well. So what's the current state of robustness and what do I mean by robustness? So by robustness I mean being able to handle errors. So I'm going to cut across to a video of what we see currently. So here's my laptop, running a compositor, I've got a text editor going, I've spent ages writing a document and we're going to mimic what would happen if a compositor were to crash. Pretty underwhelming. Not only has my text editor been closed, it's been killed because it switched back to a login manager so all my content has been lost. As you can see that was somewhat underwhelming. We don't see the same thing if false audio crashes and whenever we're doing any application development inside Plasma we try and build in a sense of robustness of handling these errors if power devil crashes, Plasma shall just reconnect if and when it comes back. So now we're going to see what happens on my laptop with a heavily patched stack throughout. Let's reproduce that scenario but on my patch Plasma session. So I have KDevelop Open which is an absolutely massive monolithic QWidget application and I've got my text editor, I've got some unsaved changes to mimic what a program is doing all day. I've got the system settings open which is in QQuick with multiple surfaces involved and I've opened this to mimic what our users apparently spend all day adjusting and I've got SuperTaxCart open to mimic what project managers spend all day doing. So now let's reproduce that same crash that we had before. We're going to pretend that Quinn had an accidental crash and as you can see everything has restored or is about to restore come on Plasma. And everything has restored exactly as it was before with absolutely no data loss whatsoever. My unsaved changes are still here, you can still see my characters spinning around and see if the TaxCart, I could be playing a level and not even lose first place. So what did we just see? We saw that when Quinn was induced quite aggressively into a crash that everything was able to resume exactly where it left off. If I left the audio running you would have heard that SuperTaxCart never skipped a single beat in its little jaunty jingle. We saw all my unsaved changes in KDevelop were exactly where I left them even to a point of the curse position because from a client point of view nothing really happened. It was some small internal reconnections but that's it. And it's worth me mentioning that I didn't touch any of the code in KDevelop or SuperTaxCart or console of any of the client applications. We only touched toolkits and shared library code. So I want to step back onto what problems is this all going to be because that's going to help explain some of the design decisions later. So your obvious one, crash handling. And I do like the fact that this person's first response into nearly dying was to upload a picture to Wikimedia under a shared license. That's very committed. So crash handling is the most obvious, most user-facing part that we're going to see fairly early on. And I do want to stress this isn't a Quinn problem. I'm not going to claim any other compositors are better or worse only that they all crash. In a screenshot you can see some screenshots form Ubuntu's automatic crash reporting website. And there's Quinn on the left and Mutter and Sway. And you can't read out the text here but each row represents a completely different unique backtrace. And so everything crashes and this goes on for pages and pages. We've just got the first screenshot in here because that's the reality of software, particularly for something as difficult as a compositor. So how did X11 solve it? There's two very important parts. Firstly, it never really did. I'm of an era where I remember X11 crashing quite regularly. To point that it even shipped by default with a shortcut to restart it because it locked up relatively often. Often enough that it decided a short shortcut was warranted. And ultimately this has gone away but only because I've been feature frozen for 10 years. And Wayland is not in a position where we can be feature frozen. Partly because the specs are still being added to reach feature parity. Partly because a lot of people are seeing a fact that it stacks open for changes again and there's a lot of pent-up ideas and experimentation that people have that people want to push. So we're not going to be able to be feature frozen for a while. Also X11 did quite a good job of delegating responsibility to other processes, mostly because it had a somewhat lacking security model. And what we're seeing is because we're trying to introduce a security model, the compositor process is adopting more and more work, more and more responsibilities. And it's adopting, it's touching so many different libraries, so many different ways of doing things to reach the same results. And I don't think this is necessarily a bad thing, but it is resulting in quite large applications to make a functional compositor that we do need to run a desktop. Also inside Quinn, and I'm sure it's two of your other compositors, we have end user styling, we have end user scripts, we have themes and all of this kind of plug-ins and interaction and certainly graphical drawing, especially on OpenGL, you get problems. We can never just hope that it all goes away. But it's not just crashes. I don't want to stress that that might have been more than motivations. It's certainly not it alone. So your developer experience is also quite frustrating. And my personal Quinn never crashes because if it did, I would have been able to reproduce it. I would have been able to fix it. So generally speaking, people ask me about some bugs that have been introduced in master. And I won't be able to reproduce it. We get in a slight awkward state. And if you've got it up time for several months, I'm not dog-fooding what we're releasing. It becomes very problematic. And the whole testing developer experience is quite messy because I can run things in the nested session. I can run things against auto tests, but nothing has quite the same experience as to actually have it on your desktop as something you're working against. Another quite interesting feature is the idea of compositor handoffs. And what do I mean by that? Being able to start an application on Waypipe, which is like SSHX, forward something to a remote server, and then move it to your local session, or vice versa, or perform a multi-headsetter where you have different compositors on different screens, and being able to then move apps between them. And this would get us to an incredibly improved state compared to what we see on X11. It's an opportunity to do something better. Another interesting feature is Unlocks. I'm not going to claim it solves completely, but Unlocks is checkpoint restore in user space. Catch your name, catch your logo. So, Creo, what is it? It allows us to take a process, or a bundle of processes, and just freeze it to disk. So it's like closing a laptop lid on a per application basis. So if we end up memory constrained, or if you want to send something to a developer to debug, or just move things between machines potentially, we've got a lot of opportunity here. But a website makes it sound amazing until you read this slide, where it says, on the Creo website, and quite big writing, it doesn't work for X11 because of the very nature of X11. And all the reason it doesn't work in X11 won't work with Rayland, where Rayland is currently written. With my changes, we unlock this problem, and then hopefully we might see this on the desktop, or potentially on plug and mobile where memory is even more constrained. It could be very exciting. So how have we done this? So the first step was to have a way of keeping a session alive when Quinn crashes. If we go through a naive approach, if Quinn crashes, and Plasma Shell and everything crashes at the same time. If all of the crash handle is kicking at once, everything becomes a very racy experience. Plasma Shell could try and start before Quinn has recreated our Rayland Zero socket that it's trying to connect to. And then we're in a very sticky situation. We also introduce potential security hole. If Quinn goes down, some other road composites come in, create a socket on the same name, and then all the other clients will try and reconnect but to this other road process, and we don't want that. So we make your Rayland socket on the file system. It's not really on disk. On the file system, your Rayland desktop file socket file, we keep it alive. It's owned by a helper process, which starts up before Quinn. It creates a socket on disk, sets a bunch of environment variables, and then spawns Quinn. With access to a file descriptor, if Quinn crashes, the socket remains. It's only cleaned up when everything exits gracefully. If a client tries to connect before Quinn has started, the client will just pause during that initial socket connection, waiting for Quinn to accept that connection and create a new file descriptor between them. This is completely race-free, potentially even improved startup speed because now we're able to start Quinn and start PlasmaShel at the same time, and there's no race between them. PlasmaShel can do all of this linking, can do all of this heavy loading, and only block at the point where it actually needs to block. So there's a lot of potential here. It's very similar to system desocket activation, if you're familiar with that, except we did it ourselves locally for various reasons. Everything remains secure. You can't just replace this file from another process. You'd have to be able to mangle with the helper. Even a lock screen is secure. If Quinn crashes, what a screen is locked when Quinn restores, it checks the current state in login D, which has a Boolean cache, or whether this session is locked or not. And the first frame Quinn would render after this is a lock screen again. And importantly for clients, we can distinguish Quinn crashing from Quinn closing. If the connection goes down, but the file remains on the file system, then we know Quinn's crashed, but we can reconnect. If the connection goes down, and the file doesn't exist, we know Quinn's exited, we should just exit basically. So all of this has merged. All of that part has landed in Plasma 5.21 and maybe even earlier, and you're probably running it now. And the experience, I think, is better than what we had before. If you log in, get a crash, you will now see your screen flooded with doctor-conky dialogue saying, it's been an issue, my client has closed. But the desktop comes back, your panel comes back, you can then just relaunch everything. Good, but nowhere near good enough. See a next step, a slightly higher step, making the client survive. So I need to explain how Waylon works in an incredibly oversimplified way. Two processes just scream data at each other, and both sides keep a cache of what the other person thinks his date is. It's no call and response, it's always just streams of information. So this is why we can't just keep the socket alive. After we reconnect, we have to resend the data. Because otherwise, if client just keeps the connection alive and then says, please attach buffer number six to surface number three, the composite will respond with, what on earth is buffer six? And then just kill the client, which wouldn't really be what we want. So we need to resend everything. The other important thing to know about Waylon is everything is asynchronous because they're just screaming data at each other. The compositor always has a final say. If a client says, I would like to grab this, at some point, a compositor can just turn around and say, no, go away. And because of this, the clients are able to have to have code to react to this, which makes doing reconnection a lot easier. We can just assume that a compositor stop everything and just re-request everything. And most importantly, all memory allocations are in the client. If I want to send a buffer, a picture of contents to your compositor, I create a space and shared memory. I create a file descriptor pointing to space and shared memory. And as a client, I send that file descriptor to your compositor. And the compositor is not creating any of the space. If the compositor goes down, the clients can still read everything that it had before. And I think this is one of the big reasons why this wasn't possible in X11. If you try to introduce this in X11, you would have so many cases where you're trying to perform a round trip or you're accessing some data you've asked X11 to hold some structures it has, it just wouldn't work. But here, the client hasn't. The client has all of the memory allocations and it has all of the data that it wants to send. It sends it once. It can send it again. I mean, from our academy, I'll talk about Qt first. So you have a QWINDOW object. It has a window title inside a QWINDOW object. In fact, we create all of this inside QWINDOW before we create a QPLATFORM window. All of this is cached in the client. We have all of this information. So all we need to do is send it again. So I patch Qt and I made it do this. When we detect your connection has gone, we send everything again, everything comes back. So let's look a bit more in detail at what was actually needed. From a Qt point of view, we had to handle it as though every screen, every monitor has been disconnected and reconnected. We have to pretend that every input device has been disconnected and reconnected. But we had code to do all of this anyway because screens changing at runtime is something that happens. Input devices changing at runtime is something that happens anyway. So we had all of this code. We just had to trigger it. We have to recreate your window buffers, so your content. And you have to do this from a client point of view at runtime anyway. Every time you resize a window, you have to create a new buffer of a new size. You have to be able to draw all of the contents in. And you are always able to resize a window. So we have all of these code paths available. We just need to trigger them. And lastly, we need to reset the shell. So what the composite thinks a window is. And Qt actually has code for this already, simply because Qt API allows you to change between a pop-up and a top level at runtime. So we had to have code to tear down a window and recreate it. So we had all of that code existing. Now, it was a lot of other glue, a lot of smaller things that I haven't mentioned. But in general, I was just accessing code that already existed and trying to trigger it. So there's one exception, OpenGL. And this was one of the bigger challenges. The way this works is when we use an OpenGL library like Mesa, we pass a point to underlying Weyland library objects. So it's WL display object, really low-level code. We pass in these structures. And OpenGL is making low-level failing calls using this library. So we need OpenGL out of Mesa. We need Mesa to stay in sync. So I pass OpenGL, I pass Mesa, and fixed everything. It wasn't quite as bad as I initially imagined. So we were able to keep the client connection to a rendered device to a graphic card completely untouched. The client at some point gets authenticated, it gets told, are you allowed to upload textures? You're allowed to do drawing. And once that's happened once, we don't need to do that again, even if your compositor goes down and comes back up. So your GL context remains intact and everything, all the textures and verdict buffers you've uploaded are all absolutely there. The only thing we need to do is reset all your IGL surfaces, which we ask a client to do, unless it also has to reset a load of internal Rayland objects, a couple of factors and globals. This was quite invasive. We had to change LiveRayland to do that. We had to introduce a new signal to say, I've been reconnected. Keep using the existing pointer I gave you earlier, but you're going to need to do some adjustments. So I haven't done Qt. I wanted to try a couple of other toolkits because trying a few other toolkits helps shape what I did in LiveRayland and what I didn't mess up, just to make sure it changes their R versatile and work for things of InjustQt. Obviously I'm at Academy. I'm going to say everybody should be using Qt because it's amazing, but other toolkits do exist. So SDL is what SuperTux cart was written in and you saw at the end of the video. I patched it. And the changes were relatively small. It was around 75 new lines to do everything. And once we've done that, we didn't have to change any of the SDL using clients at all. Just 75 lines inside SDL code itself. And that's relatively manageable. All of the changes were quite difficult. I'll be honest, this took me absolutely ages, quite embarrassingly long for what amounted to be 75 lines, but it's quite manageable. X-Rayland. I patched X-Rayland. There's a theme coming. So X-Rayland is a full X server that then has a Wayland connection for passing the services and buffers and passing input events to your clients. But X server itself is a cache of client state. So all we have to do is send all of that information again. And the X-Rayland changes themselves were relatively straightforward. They're quite small, just resending everything. And again, I was mostly reusing existing code to handle resending a surface or a window, resetting input devices and such. But what was a challenge here is it was also making potentially blocking calls into X11. Quinn can handle X-Rayland crashing. So we've got a bit of a chicken and egg situation. All of the boiler plates around launching and managing required a bit of work, but we were able to fix it. The deployment is going to be a bit hard because we had to make some changes to X-Rayland, but a lot of changes to X-Rayland and then how it interacts with your compositor. But the important part is it works. Firefox. Now, this is quite interesting. Obviously, there's been a theme so far. So the next slide, I didn't crash Firefox, but I was able to create something that worked quite successfully. I changed the .desktop file to be well to run Firefox. Obviously, because I'm a professional, I checked the exit code and did something slightly better. But effectively, this is what I was running. And I did this because Firefox already has amazing crash handling support. It can handle restoring everything. You have to press a button in the dialogue and you get all of your windows back. So it's got all of this code available and Firefox does some very quirky things with X-Rayland. So it potentially would be quite hard for them to gain this research support. So I've included this slide to show we don't necessarily need to have a one-size-fits-all strategy. For Qt, it makes sense to talk out, handle everything and just reset everything. For something like WLPACE, a small command line tool to pay to clip all the contents, the best thing is probably just to exit. We don't need to have one-size-fits-all. The strategy with the sockets allows for a couple of different techniques. And I think a slightly better version of this might be a way forward for Firefox. Maybe not. Maybe I should have before we set support. Plasma Shell. Plasma is kind of unique when it comes to WLPACE because not only do we have the common WLPACE protocol that every other application uses, it has a load of really bespoke code for getting information about other windows, checking its own window positions. There's a lot more protocols and a lot more WLPACE code. And I could have added reset support here. Or I can just let Plasma Shell quit and just bring it back using the crash handler. And we can still save everything first. There's not any risk of data loss because you don't really interact with Plasma Shell very much. So potentially that's a way forward for Plasma Shell as well. But what's a worse case scenario? What if I have an un-supported client? I haven't mentioned Chromium yet. A big model of application or some un-supported case. What's an absolute worst thing that could happen if it does some quirky code? Maybe if it does a round trip of making a blocking call that I said is one of the cases where we could fail? Well, the worst case is the client closes, which is what happens now. Nothing we introduce potentially makes the situation worse. Clients can opt into this and we can get a brilliant user experience. If clients don't opt into this, nothing gets worse in the current state. Obviously we should still try and make sure Quinn doesn't crash. That's always a goal. But we have a backup. A fallback is that nothing gets worse. So to wrap this up, it definitely works. This is more a disapproval concept. This isn't quite deployable at level, but things work really quite well. And I want to stress that no changes were made to your client applications. The K-developy saw it earlier, it was just absolutely stock. Not even any recompilations. We just changed the toolkits. Job done. Just dropping in new labels. The changes are complicated. They are invasive and difficult paths. But ultimately, they're not too big. I mean, that SDL patch is manageable. You could review it in an hour. Hopefully, this is something we can do moving forward. So what's next? What's left to do? Well, we need to start up streaming accord changes, which I've been putting off because it's scary. And partly because I wanted to build up this repertoire of some toolkits that have been ported. Just so I can make a case of these library changes definitely work. And they work in this variety of situations. I mentioned of PacMessor in the OpenGL paths. I didn't patch the Vulkan paths. In theory, it should be exactly the same with a very similar idea. We just need to actually make those changes. There's a few paths in Plasma integration where we need to follow up on. Effectively, anything that does low-level Weyland code, we have to treat it as though the composer took away a global and that the global has come back. A certain interface has been removed and come back. And in theory, clients should be handling all of this already. In practice, they're not. One thing I hope to do in Qt6 is hook up some very generic signals into QtWeyland extension that I mentioned yesterday so that it treats it exactly the same as the composer removes an extension or a crash has just happened. And then you have an opportunity to reconnect with the same code path inside your client. So we need to follow up inside Plasma integration in a few places. And obviously, our toolkits out there, GTK, Wine is quite a big toolkit that we want to have native support for. But hopefully, they'll get jealous of what we're doing inside the Qt space and implement these changes. I have been speaking to one of the GTK developers who seems on board with this as an approach. And there are a lot of edge cases potentially. I mean, anyway, we'll find these for extensive testing, but I'm sure edge cases will come up. I've been running the Qt restarting on my laptop since around February. And things are working generally just fine. But there have been some nuances. At one point, case green didn't work properly, that's fixed now. But I'm sure we'll find others as we continue. When will I see this? So I've mentioned there are some Qt6 changes. And implicitly, that automatically makes it a while. So 18 months from now. But potentially, it means when we start moving everyone to Wayland by default, we will have all of this in place. So it's going to be a while, especially as there are underlying changes that need to happen throughout a stack. A new API, which has to land before these toolkits can make use of it, which is always a frustrating chicken and egg problem. But we'll see what happens. Can I run this today? Well, I've mentioned before, there are a lot of patches needed, from LibWayland to Mesa to all of these toolkits. But it is doable. I'm not going to read out URLs, but if you look for my relevant blog post with the same title, you will see a set of links to all of my patches. So, any questions? And here we are. Yep, I've had a quick change of clothes from now. Yep, here to answer questions. Oh, Luigi's gone. Oh my goodness, I've killed Luigi. Fortunately, Mario and Luigi have many lives. Luigi, I can't hear you. While I wait for that, Kai sent me a message and asked me to just prove that Qt Creator works. So Qt Creator, I'm going to do a live demo. Quinn Wayland replace. Oh, will it work? It does. Qt Creator. Oh, I'll read some of the questions myself. Kai says, you mentioned custom Wayland protocols in Plasma Shell. Will this affect any client that uses custom Wayland code? Yes. If you use custom Wayland code, there are changes that are going to need to happen. I'm hoping we can abstract this. So your path for our global has been removed and things have been reset is the same. And then from your client point of view, you'll just have to implement one thing. Uh, Nico asked how does GTK behave? Well, right now it will just close like it's doing now, which as I said, it's no worse. I think ultimately we are going to have to convince people we've done this demo. We want you guys to be involved and then introduce this into Qt. Luigi, did you want to say something? Uh, yes. I guess you can hear me now. Hopefully. It's a talk about restarting and not restarting. So I guess it's appropriate. Okay, so you answered. I answered your top two. Kai and Nico. Okay. So we have the other from Nate. Are there merge requests for these changes that we can follow? I guess you have not started yet. So I have branches that are pushed. If you can find them, it's one of the Wayland, one of Mesa and personal folks. I haven't turned these branches into merge requests. Partly because I'm just trying to do cleanup at the time. Qt Wayland, you'll find my fork and invent somewhere, but it's full of some Qt bug noise. So it's just a little bit of cleanup to do to get it to a less embarrassing state. I hope to do all of that before Academy. That was clearly a plan when the call for papers came about, but that's not how things work in reality. Oh, no problem. There will be time. And it's good to hear that the other toolkits are at least partially on board with this, because otherwise it will be kind of not complete, but it's ready. It's really good. There is another question. What will be a potential timeline for this work to be merged? Well, it's connected to the use one. Yeah. So, especially at this end, we've got a Qt Six Alliance that puts a minimum timeframe on it. It's going to be at least a long time. But I think we can still have this happen before Plasma moves to Wayland by default. I think doing a timeline based on that semantic level of before we move a default for our users, that seems doable. That's good. There is another question from Kai. I think that's, yeah, okay. You reckon we can make sure Wine's relatively fresh new Wayland support effort can already cater for this? Sure, why not? I mean, relatively speaking, we're only going to be using a small subset of the Wayland protocols. It will be similar to adding, I said SDL was 100 lines. I think I said it was 75 lines. It's going to be similar for this. It might be more. I'm not going to claim how many lines of code, something I haven't looked at is going to be, but it should be managed or should be exact same approach. They might have the same workload. I like how you both are in matching t-shirts. Oh, it's the Academy t-shirt. Explain it. Yeah. I have a question for you, David, as well. Yes. I have been advertising our pub quiz and asking people to send photos all this event. Have you done it at least once yourself? You'll see which desk of mine. There'll be me in the way. Okay. We are having this pub quiz together. We're the hosts of that. Maybe you can use this occasion to advertise it at least once. Yes, there will be a pub quiz first day at some point in your time zone. Look it up. I couldn't ask for a better advertisement. Well, you could. You wouldn't have got one. That was literally me trying. Okay. Some helpful person just put a link in it. Oh, it's you. Some helpful person put a link in the chat. Okay. Are we done with questions? Luigi, did you find any other? There are no other questions, it seems. Yep. So I guess we are done with this. Thanks again, David. And we can restart in three minutes, I think. Three and a half, something like that. Good job. Nice. Yeah. And in those three minutes we'll be hearing from Bjorn about how we can solve the personal data problem. See you in a few minutes.