 All right. So let's continue our journey in the Linux channel. And this time, Wolfram will discuss object lifetime issues in various subsystems. So please welcome Wolfram. Thank you. So happy that you are here. And I think despite the kind of boring title of the talk, I think it's going to be quite entertaining and fun, hopefully. Before we dive into the topic, I need to really set some basics straight because if there's some level, some only half understanding, then you get easily confused because it is a complex topic. But first asks, okay, this is a rough outline, basic problem, results, conclusion, nothing special about it. Who in this audience does not know about reference counting? Cool. Awesome crowd. We can skip this. I have lots of slides. So I'm happy about every slide I can skip. So the Linux kernel also has some way of reference counting, which is embedded for mostly in a struct called, the reference counting itself is a struct kref. There's some documentation about it. This is mostly tied into a struct k object. There's also documentation about it, if you really wanted to know about it. But most importantly for this talk, this struct k object is embedded into a struct device. And this is like the key struct now for the Linux driver model, which is especially important for the research I was doing here. So it's all about struct device, creating devices, and those devices are reference counted and should only be gone when the reference count is zero. To modify the reference count, we have get device, put device. Yeah, that's about that. So then I assume who does not know the difference between physical and logical devices? Yay. Good. So we have physical and logical devices. I want to extend a little that beside the physical, here's the platform device, which is tied to some hardware. It has memory as an IQ. Here's the user space side. It's not, it's a struct C dev. So it's a little bit different, but it exports functionality to user space. In most cases here, we have an intermediate device, which is also a logical device. So there's a struct device involved. And the I square C world where I mainly work, it's called struct I to C adapter, which is basically abstracting away the hardware difference. So if someone from user space or from kernel wants to say, I just want an I square C transfer, it talks to this adapter, and how this works in reality is done with a platform device. So this is one layer of abstraction, also usable in the kernel, but it's important. It's another struct device. It's a logical device, and it has its own lifetime. It is reference counted basics. The problem. Let's start. One annoying thing about this is the problems are rare. And just coming off, if you're doing a regular work cycle, it is usually like this. So you don't see them so often like other problems. When you boot there, the kernel finds a platform device that platform decides, hey, I'm an I square C driver and give me an I square C adapter, so I can expose my functionality. Kernel happily does that. So we have now a struct I square C adapter already with a reference count of one because the guy rather the code wants to expose the functionality, of course, has a reference count. Later, then user space comes along and says, hey, give me your functionality. I want to talk to that device. And so yeah, okay, it connects to that. And we have a reference count of two. So far so good. Then someone user space is done. Either the device is removed or the device is just closed. Doesn't matter. This is not having a reference count anymore. Reference count is going to one. And then when we unbind the driver using Sysfs or reboot the system, the platform device says, okay, I'm going away. Please delete this this adapter. It calls delete adapter. Ref count goes to zero. So it's not reference counted. This can go now. No reference count anymore. And then later in the power of this platform device can also go. So no problems with that. But if you remember the previous slide, when there was a reference count of two, there are two candidates to go away. So it also could happen that the platform is device is going away while the character device still has a reference to this intermediate intermediate device. This is alone a whole set of problems. But it's not the main thing I want to talk about here. But you can imagine that if you're not carefully checking here pointers, you might have into that structure, you have no pointer differences. And even if you check for that, you have to really be careful not to run into races that between the check where you do that and the platform device going away, whole set of problems. But with the previous talks by Laurent and Bartosz, they were focusing on managed devices. Laurent said def mkzalog, this function which allocates memory is harmful. Bartosz said, no, it's not. That sounds like a contradiction. I don't think they're both so far away. But this is anyhow, this gave me the idea, okay, I want to research that. And what is the problem with if you use managed device, especially allocate def m with def mkzalog? Like in ice grocery, there are subsystems where the platform, the physical device needs some private structure to operate. I mean, this is pretty common. You need to store some state machine or whatever. So every structure, private structure here. But like ice grocery kind of expects you to have this ice grocery adapted this intermediate device in the private structure. And where's my pointer? There it is. But this is now we're mixing. And if we use def mkzalog, we attach the lifetime of this private object to the platform device. But we have intermediate device with a separate structure device which has its own lifetime. We're mixing them. This is dangerous. This is really dangerous. And when we use this adapter, the system goes live. So we open the possibility that user space or other users from kernel may come in and increase the reference count. So what then is really bad is let's assume the example before there's this intermediate device and it has still a reference from the user space and the platform device goes away. Now, if we use def mkzalog at the end of the remove of this platform device, it will free the complete private structure. And there is this intermediate device in it. So we actually killed a device which has a ref count of one because user space was still connected to it. That's super bad. Because, hey, that's one of the promise of an operating system. If there's an RF countered object, it's not going away. So subsystems do that. They need some kind of protection to not let this device go away, although the platform device is going away. And now here comes def res or these managed devices. If you don't use it and use kzalog and k3 directly, there actually is a way where you can do it right if you're super careful. There are still a lot of ways to shoot yourself at the food, but at least there's a way where you can get it right. Namely, we will go into detail later, but if you use this release function which every device has, if you populate that, then you can get out of this mess. But with def mkzalog, it will always be wrong. Because even if you populate this call, remove callback, which waits until all calls are gone, def mkzalog will always remove once this device is gone. It doesn't matter if here's a ref count or not. So def mkzalog in that context is always wrong. So that is why I would say this is not causing the problem. You can have the problems without def res as well, but it makes it easier to fall into the trap. That is the problem. The kind of good thing is because it is always wrong, we can detect it, we can scan for it. Because I don't need to scan for exception. If I see that pattern, it's always wrong. And since I love coxinell, I wanted to say, okay, who else in the kernel does it like this? So the kind of maybe little academic formulated question is which subsystem use structs with embedded k objects and allow these structs to be allocated with def mkzalog. To make it more real, think about i2c. i2c has such a struct, namely i2c adapter, it has a i2c device in it, and it wants drivers to use def mkzalog. Or it allows, not once, but it allows. So sadly, as a maintainer, I have to say I know i2c is one of the candidates. Uh, who else is there? So I ran one coxinell script to scan all include files and find structs which include a struct device directly or indirectly, because if you just do one iteration, you won't find all. It can go up to many iterations until you get finally to the struct device. I used three because my poor computer was busy enough with that. And I found six around 630 structs. So this is quite a lot, I think. And from then, with these 600 structs, I searched, okay, and where in the kernel are those structs allocated with def mkzalog? Because this is a potentially dangerous situation. So I generated 630 coxinell files, which I then ran, and then I got some outputs. And then I wanted to know how do these subsystems protect against this too early freeing of the embedded struct device. That is basically my research question. So far so good? Or somebody already left behind? Okay, so a little bit of a first result. The good news is I got less hits than I anticipated. So actually, I can show you in this talk all of the hits I got. Hey, that's good. But the bad news is that with all the hits, although they have some kind of protection, I still have issues with them. And more or less, I basically, they all need fixing. Okay, I'm brave again. I will do live demos. What could go wrong? Let's see. Can you read that? Okay, so you're all waiting. Something is going wrong, right? Okay. Here we have the serial console. And here we have a telnet, both points. It's a Renaissance port, of course. Thank you for giving me such nice hardware. And a good candidate is UART. It was not in my list. So it doesn't look, it wasn't reported by my research results, but I still wanted to give you what the end result should look like. I can't give you the bad results because they're bad. So here on the right side, there will be the user space side. So I will open a device exported by the UART layer. On that side, I will now unbind the device while someone has in user space has a reference to it. So go to this, what is it? TTY class? I always mix that up. TTY, which device was it? I need to check this. It's E6C5. Okay, device driver, echo, E6C5, unbind. Okay. I have some debug options enabled here. You see objects got released. That's fine. The user space exited normally. It might have exited with error code, but this is still gracefully. We didn't have any delay here when returning from unbind. This is what we want. We unbind, we get back immediately, and the user space deals with the problem. Okay, good case. There we want to end. Ah, yeah, MTD. So MTD also exposes devices to user space and I will just open one of them. So I just assigned this file to a file descriptor. It's basically opened. It does nothing. It's just open, right? But it increases the reference count. Now let's unbind this one. This is a class again, I think. MTD nor read only, device driver. I'm now at the SPI nor driver and I unbind this exposed SPI. It has nothing to do with SPI. We're on the MTD layer level. So it's SPI nor. Yeah, thank you. So let's see how MTD, wow. Holy shit. It's still crashing. So lots of things going wrong there. And I think the machine is even in a such unstable machine. I think I can't reboot now. No, it's okay. Luckily we have a reset button somewhere. Okay, this was, that was MTD. So as struct SPI nor embed struct MTD info, which embed struct device. That's why I said you have to do an iterative process. No, okay. It crashes as you saw when while MTD read open is opened. I couldn't find any protection mechanism in the MTD core code. It might have been that there is one and it's not perfect, but I did not find one. But I found this command from March 2009. Revisit. Once MTD uses a driver model better, whoever allocates MTD info will probably want to use the release hook. I think this is probably right. So with this, I think there's a better option than using the release hook, but that we will leave that for later. But you see, okay, MTD is seriously broken since ever. I square C. I square C zero, I think. Yes. So I open I square C zero. It's a character device. Okay. You want to see a crash, right? I use always, as you can go this way or that way. And I always use the path I can remember most. Okay. There is it. You're making bets already. Okay. It doesn't crash. This is good. So it has some kind of protection. But as you can see, the unbind does not come back. Bartosch in his talk mentioned it was a deadlock, but it's not a deadlock. A deadlock is a situation where you cannot get out. This is an uninterruptible weight. But so and if you have a better device where you just have one terminal or shell or whatever, so then it can mean that you're stuck. But luckily we have this terminal, this telnet client here. So I can remove this reference count from user space. And then you see things happening. Devices get removed and the system is in a consistent state. So it doesn't crash. That is good. It blocks, which is not perfect. Embed struct device blocks uninterruptible. I square C delete adapter waits until all references are gone using a struct completion. I will explain this in more detail on the next slide. But we also have a great comment. Fix me. This code is old code and should ideally replaced by an alternative, which results in decoupling the lifetime of the struct device from the adapter like SPI does. Yes. That is still true as of today as it was in 2015. And the problem is known for much longer. It's just I put the comments there in 2015, so I won't forget. And because I always forget the details, I made this talk because I can now look up in the future to remember what actually is going wrong. So why does it not crash? This device is meant to, it's unbind. It should go away, right? So the remove callback calls I square C delete adapter and there's this wait for completion. It waits because there's still a reference to this intermediate device. When the user space closes this or deletes or gives up this reference and this ref count goes to zero, then the release callback of the device is called or yeah. I square C core populated that before. So when this release callback is called, we know the ref count is zero and then we call the complete and say, Hey, it's done. Finally, the waiting is over. Completion is completed. And when the completion has been signaled, it's done. The waiting ended and everything can be freed safely now. This is why I square C has to wait, but it doesn't crash. Ideally, it would be returning right away. And if user space would connect to the users, I square C dev device, it would say, sorry, all my functionality is gone and keep saying it that long until user space says, ah, come on, I gave up. There are some other subsystems which have the same approach of protection. This one is I3C. The master controller embeds a struct device. I couldn't really test it because I don't have I3C hardware. Regarding their own master controller thing, there might be a protection. I haven't fully understood it, but I did not go into detail because for them, sadly, they have I2C backwards compatibility. So they embed an abstract I2C adapter and use I2C delete adapter to remove it. So we kind of caught I3C and said, Hey, you're with us now and you need to wait until I2C is fixed until you can fix your stuff. Sorry. I found another subsystem doing that. It's called NTB. Does anyone know that? Okay, a few. Cool. Non-transparent bridge or something. It's probably not too interesting for embedded devices. I3C Express can do magic things with it. It also has, just by code review, so my research, my coxie now scripts told me there might be such a subsystem and I reviewed it and I found they also have a completion. So which I think is worth fixing. That was the surprising result. Word.io. Okay, let's check. What could go wrong? Ah, I don't need it. Now you got it. Now I'm lost. Where's my driver? What a pity. I don't know how to fix it right now, which is sad for you because it would have crashed. Even without me opening some additional reference card from user space. How come? Oh, no, let's go back here again because I have a debug option turned on. Virtual IO device embeds a struct device and by design, their subsystem forces all users of their subsystem to use a release callback of this virtual IO device they created. I think it's not very elegant to push that responsibility to drivers, but it does work. If you allocate the struct device, you make sure that the freeing option is connected to this actual device and then you don't touch it yourself anymore. That basically works. But one driver, this MMIO driver got it wrong and still used dev res with it. And if you enable this debug option, which nobody knows unless you're working with object lifetimes, then it fails. Why? The good case. Why how all the others drivers work. You allocate with kzalac, not managed. The virtual IO device, which is reference counted. And then from that virtual IO device, you populate the release callback with the accompanying freeing option. Only when the reference count is zero, then we're freeing that memory, not before. And then you do something with the device and then on unbind, this device is the terminology of the driver core is not good here. The device is deleted, which means it's taken away from the driver model. You cannot connect to it anymore. You cannot get a reference count anymore, but it is the memory of the struct device is still there because in the driver model, we call it release if you want to release the memory. So the device is deleted. You cannot connect to it anymore. And because let's assume we're the only user, the reference count goes to zero. And now it actually, which one comes first is not so important because let's assume we're the only user, reference count goes zero immediately. We free the memory. And then we can also free all other resources connected to this platform device. But it could be the other way around because this intermediate device takes care of itself. We could delete this first and wait until all references are gone and then delete this one. Understandable? Not so much. Where's the problem? Can you say it? No, pity. Shall I try again? Okay. The problem with dev res is now that it totally breaks these assumptions. We're allocating memory with dev mkzalloc and populate the release function not with k free but with dev mk free because we use this one for allocating, which sounds like a good match. We do something with the device unbind like before a reference count goes to zero. And now comes the difference. And this is where dev res is really harmful. Once this device is gone, because it was allocated with here with dev mk.alloc, it will delete the device nonetheless. There is a release callback populated, but dev res will still kill the memory or free the memory before that. And that's why we got the splat because then when the reference count is going to zero and this debug option, what does it? It delays the release. If this is all going instantly like we go here to zero and right away this will be removed and this will be removed, you will not notice anything because you're lucky with the ordering. But if this debug option waits a certain time, like a few hundred milliseconds until we release the memory, then you get the dev res problem because dev res will remove this and including this. And after a delay, the driver core will try to call the release function and all the device is gone. There is no release function anymore. Boom. Not good. This is, like I said, subsystems which say we use, I request you to use a release function. It's technically, you can argue that it works, but I really don't like this approach. If you look here at the slide, there were multiple tries to get the lifetime thingies right. And so there was the initial submission and then the driver core found out there's no release function. That's forbidden. So it emits a warning. So what did the first one guy do? Well, it put an empty release function so the warning goes away. And then someone else noticed, no, it shouldn't be empty. So it added this dev MK3 just to match it, but still not understanding what was the actual problem. And while trying to get that, this guy got the, all the def m management wrong. So he needed another fix up. So at least the def m part was consistent again, but it was still wrong until a few minutes ago. I sent a patch. And this is what I think we have consensus on this one. Bartosz formulated this in his talk. When you listen to the complete talk of Laurent, in which he gave at Plumbers, people were also like this. It is not a good idea to put the responsibility of this lifetime management to driver authors. I think most people agree that these layers who introduce a struct device should take care of it and not say, here, you do the work. Experience says it goes often wrong very often. So I could fix this subsystem. It is okay now, but still I think the approach in general needs fixing. There's another quite new subsystem I'm not super familiar with. It's a bit scary, actually the auxiliary bus. It also expects that you populate the release function. Unlike other, they're pretty explicit about it. So let me show you the documentation. So they really have separate documentation here about the lifespan and how they expect the management of this lifespan should be. So it's quite long. Must be good, right? So they're at least explicit. So you can read about it. But still, I think, just from review, I think one driver got it wrong and using depth rest at the wrong point. I can't test it because it's very architecture specific. I think it's Qualcomm. I'm not sure. I'm not blaming them. I'm just, I don't have that hardware. I can't test it. But it seems that the other drivers do it right. But right, what does mean right? This is an example how this auxiliary bus is used. So first, you request the bus. I don't expect you to understand that in detail. I just want to give you the impression of how this looks. So we request the bus itself. It seems you can use managed devices for that. Then we want to have certain devices on that bus. We can't use depth rest for that because they're separate lifetimes. So we use KZL. So we have a mixture of those two. This is already And then the auxiliary device does not have all the fields this driver needs. So it puts a wrapper around the AUX device with the AUX device itself and the additional information it wants to store. It needs a unique number. So we need this IDA to get a unique number. And then it starts filling the stuff with like, I mean, AUX bus, AUX device, AUX dev dot name. This is not exactly short. This is not exactly readable. And I think it's kind of fragile. So you need all, where is my pointer? All this code for one device. And since they want to have a second one, they do it all over again for the other device. I think this is not maintainable code. Easy to read. Easy to read. Easy to get something wrong in there. And I mean, Greg Croatman likes this subsystem, I have been told, but with that IPI, I'd like to meet him soon. I would suggest something like this. And this is really the pattern I like in general. And this is probably also the pattern I square she subsystem will evolve into. You have here your private struct and fill it with stuff. And then you allocate the device you want to use later. Well, I need more. You allocate the device and the core will take care of all the bookkeeping which is needed for the embedded or for this struct device, which is needed by the subsystem, not by the driver. And then you get a device back and then you can fill the fields which are needed. If you still need to do something special, you can have an optional release callback, but if you don't really do nasty things, most drivers should be able to get without that. And then once you're done with that, you give this object to the subsystem again and it will do the rest, especially when unbinding. This is what SPI does. This is what NetDev does. This is a pretty common structure in the kernel and I think we should apply it everywhere we can. So far it worked. It has other benefits, but I can't mention them right here because I'm already short of time. To my surprise, USB gadget was also reported because there was a mixture of release function and def m. I was surprised because USB is usually the ones who get it totally right because they know magically disappearing device best, I think. And indeed, I could not trigger a problem, but I still want to find out, but I need to investigate this. Either I was not triggering the right problem or they're actually safe, but I haven't understood why. So their USB gadget is still kind of an open target. But maybe depending on my investigation, I might have bad news for you in the future. My conclusion, the first summary is not so surprising. We have different kinds of life cycle problems in the kernel. I was just presenting these def res embedded complicated thing. There are even more. Some of them are really long lasting if you think about MTD, like with the command from 2009. With the ones I was researching here def mk.eloc cannot be blamed for it. I mean, you can do the same mistakes without it. But as I said before, it makes it easier to fall into the trap and enabled me to do this initial start of research because def mk.eloc is always wrong in this situation. But with such research, if you answer one question, you get at least three new ones. Further research activities, I see I would like to, I think it's possible to have a coxinell file also finding this manual once where you use kz.eloc and use k3 not in the release function, but in the remove function at the wrong place. I think this can be encoded in coxinell as well. And then we get likely more hits. And then as I said, to limit the search space, I was only scanning for struck device for a complete thing. I should scan for austra k objects. Then I will get the character devices, the CDES and all that. And there's more. And when I found this one driver, which tried to solve the situation with an empty release function just to silence the driver core, I think we can check for that as well. There might be corner cases where this is okay, but in general, that's indicating a problem and somebody not understanding what's happening there. So we should check those as well. Potential solutions. I already mentioned pushing the responsibility to drivers that she seems to be agreement that this is not a good thing to do. In Laurent's talk, he just, I think mentioned a garbage collector, I think only for completeness, because everyone who hears it is like, oh, no, I don't think this is the way to go. Bartosz had a nice idea. There's a underscore underscore clean up annotation you could use where you, if you call a function, you already defined the cleanup function. And if you combine this with res counting, you can make sure that things will happen magically. That's the research to be done from my gut feeling. It is technically possible, but it's a paradigm shift because you usually, in the kernel, we're used to, you allocate something, you free something, and this would be different. So even if it's technically possible, I think Bartosz has to convince a lot of people to get this into mainline, but maybe it's worth it. I don't know. My preferred solution is what I said before, do this thing with ALOCK, ask the subsystem to give you a prepared struct, and then you work on it and then you pass it back to the subsystem. So it will do the magic it needs to do for the devices it uses itself. What you do with your own resources, it's your thing in the driver, but this separation needs to be made, I think. But converting I-square-G, for example, to that, I mean, I have more than 150 drivers, which technically need to be converted. This is a lot of work. Yeah, I just, and even more life cycle issues exist. I mentioned character devices where the platform device go away from video for Linux. I know there's some, Gerr told me about like dependency hell where he cannot unbind this device because another one is gone and DRM, okay, sorry, but it's all media. So there's also stuff and then there's this case that there are some existing protections which are not perfect yet, there are race conditions, they all need to be audited. Plenty of things to do. And since safety is a bigger topic this time, I want to mention that. As far, I have only half knowledge about safety, but I know you can't audit Linux kernel as a whole. So what you do is you describe processes, how you handle if something goes wrong, and then this gets accepted or not. So I think our process, if an issue regarding life cycle of objects and Linux device drivers is discovered, and I really mentioned device drivers here, I think the kernel core kernel code is way better looked at and I don't expect such as let's say simple bugs to be present in the core, but in drivers it's quite present. The process is we add it to the list of already known life cycle problems. It is agreed that fixing these issues would be great, but nobody does it because they'll be dragons and it's simply a lot of annoying work. This is, I think, kind of the status quo we have with life cycle issues now in the Linux kernel. So that brings me to my last slide, the future. I think it doesn't look too bright. I don't expect too many things to change in the near future because the problem is not like, it's not like, oh that's an interesting problem, I can go for it. What is kind of interesting is what I did, do research. Where is this problem? Where can I find more instances of this problem? This is kind of fun, but if you then see, okay I need to fix the whole subsystem, that's so. And so I could, I guess we see some of these fix me still in 10 years, if not somebody is throwing a lot of manpower and money to the problem. So raising interest in the topic and raising funding will be, I think, the necessary first step because this is no part-time work. I will try. We will see. Until then, I personally will still start fixing the I2C subsystem. Let's see where we go with that. At least because of this talk, virtual I.O. is now fixed. At least one good outcome, right? And I want to support, thank you Renazath because they're partly supporting this work. I should mainly do other things, but it's okay for them if I spend time on such things. And of course, CoxyNail, it's such a great software. I don't know how many bugs it already eliminated or found or whatever. I love it. And with that, just five minutes late, but you still get some snacks, I promise. This is the end of my talk. Thank you for being here and I hope you learned something and enjoy the conference. Let's pretend we have time for one question. So it has to be the best one. I can't say I understood everything from your talk, but I'm thinking, would it be possible to teach DEVRES to do something special for reference-counted objects so that somehow that interaction, that bad interaction is solved in DEVRES itself? It's probably too difficult to answer that immediately. I don't think so because as I said, there's a different level of nesting where the actual reference-counted object isn't. With that already, I see a problem to teach DEVRES that. What Laurent mentioned in his talk is that DEVRES is mainly misunderstood. When DEVRES was sent mainline, it was clear the promise was that the objects tied to the device will be removed on unbind. He clearly said that and then the next slide, super funny, what people read was it will remove that at the magically at the right time. That's what people hope for. That's what they want, right? But this is not the promise of DEVRES and I would need a little bit of hacking and trying out, but I don't see piggy-packing that on DEVRES is the right way to go. I might be wrong with that, but it's just a gut feeling. So, thank you. Thank you very much.