 Yeah, welcome after the break for my talk about Linus Kernel locking engineering. I'm Daniel Wetter and Oh, maybe a room order first if we can Compress a bit because people are still walking in that might be nice. Thanks a lot because it seems quite packed. Yeah, so I Guess the first question and it serves as a nice introduction is is like why this talk. What's what's my goal at least he and I've been doing Kernel maintainering and the graphics subsystem for over 10 years now first in for the Intel graphics driver since a few years as co-maintaining together with Dave early for the entire subsystem on the Graphics has grown quite tremendously over the last 10 15 years. We've Got a lot of drivers. We had to add a lot of features. We had to re-architect a lot of code and fairly often the locking turned out to be a mistake and So over all these drivers and I mean some of these drivers were like locking what what do I need this for like my display lights up This should be good enough. Um all the way to Why certain locking patterns and design Approaches are kind of hard to debug hard to maintain how to refactor and Over the past few years. I've somehow become the person that gets approached every time someone has a locked up splat in the graphics driver so a bit by Self-fulfilling Prophecy I've become the locking expert and I've tried to like distill all the lessons learned from the specific cases over the last few years into kind of More abstract patterns and anti-patterns and principles And this entire topic here started out as some internal trainings a few years ago Last year finally got around to write it up into a blog post So those are the two links the slicer on to get calm So you don't have to take pictures and that's also like the structure of the talk It's essentially two parts first a little bit about more general principles that are overarching over all the All the different patterns and challenges and the second one is really a collection of kind of locking design engineering patterns a bit structures into a hierarchy starting with Easiest to understand the easiest to debug and the easiest to refactor and like combine Down to the ones that you really only should use when you've exhausted all of their options So that that's going to be the rough structure so to start off let's let's look at some of of the principles that I Kind of distilled from all these bad examples and some of the good ones that I Think we had and the first is like But what what should be the priority in locking design on The absolute first one is is make it simple Like debugging broken code is hard Debugging broken locking is much harder. And yes, we have like toolings like dark depth splats and the new Memory race Sanitizer in the kernel and I think it's KU be Santa something I forgot the exact name and all these tools that help You find bugs kind of before They burn up in production But essentially My rule of thumb is if if you get like get a lock depth splatter some deadlock from a customer Whatever and it's not immediately obvious to you What's wrong? Then you're locking is too clever and you need to make it like done It might not be obvious at all what you need to fix right like that That's an entire different story. Sometimes you realize how this is gonna be two years of refactoring Until we can actually fix this but it should be the locking should be so simple that when you get the bug It should be pretty much immediately obvious What was wrong? Otherwise like it just gets too hard and there's so many cases of like driver altars coming with locked up splats and have no idea What's what's even wrong? The second priority once you've made it simple enough that you can actually fix fix bucks in there, right? You should maybe try to make it correct like That's very like this is a bit like over the top and cynic right, but I really think if you If you try to make it correct before you've made it really simple You've probably screwed yourself already into a corner And then once you've made it correct and like the box are at and the kernel doesn't crash and the driver generally does the right thing Then and only then can you look at making it fast and only like fast enough So the making it simple should be pretty clear or at least that's Yeah, if you if you don't understand the bug reports, it's just too complex That that's kind of my rule of thumb, but let's look at the other two a bit So making it correct rate the first rule there and on they even brought a bunch of mailing list rants and block posts on this is Design for locked up never against it Like if you get a locked up splat and analyze it And then conclude with it's correct or not whether you're right or not that locked up is wrong And like your design doesn't deadlock but locked up doesn't understand it Then oh Maybe your code is correct, but your design is wrong Right. So locked up through through these classes very kind of groups locks together And just tries to find deadlocks among entire classes of locks forces you to have a Locking design where the rules don't change Right, so you can't have like an object that has like locking rules Like a nest and B and then later on it nests the other way around So you can make this correct, but lock that will not understand it and So locked up in a way really forces you to make a locking design that's that's a lot simpler And so it kind of feeds back in into the first priority Kind of related to that on avoid fancy locked up annotations like the nesting and that kind of stuff because You can actually then in those cases you can have deadlocks and locked up says this is good And your kernel has actually deadlocked, right? And so those those annotations are very dangerous You really like you should leave use like the standard locks with their annotations as they are And as soon as you start shoveling around with locked up keys and locked up classes and nesting orders You're very much in like dangerous territory Um another thing that I think is really good is like documentation is cool, but executable documentation is better So with complex locking nesting Oh What we've done in a bunch of cases in graphics is for the case of config locked up We just run a function at module load which takes all the important locks in the right order So that whenever you load a driver You never have like the case that like driver a has a certain nesting and driver B has it the other way around and This way you can enforce that the cross-the-sub system even if people never load like more than once driver Which is the usual case with graphics? You don't end up with an inconsistent locking order so I think priming the Locking order when config locked up enabled this is really good. This is a special important in The memory management side in graphics that we have interactions with shrinkers and the memory mapping semaphores and all these things They're really like the graphics lock need to be exactly In the right slots Fitting into the overall memory management locking hierarchy And the other things to to make a locking correct is All the annotations you can sprinkle it with all your code so Especially when you have like fast pass Where you take certain locks only in the slow path then might lock is a really nice annotations because it makes sure that You always pretend You're taking the slow pass the same with might sleep, right? The fairly new one that I added I think a year or two ago is might a lock Especially when you're interacting with the memory management subsystem either with page fold handler or with your own shrinker or with MMU notifiers then might a lock is is pulls in the entire memory recline hierarchy and of course like if you have functions that assume that certain locks are held Then locked up asset held as your friend so These are kind of like essentially use locked up to to prove your design and make make sure everyone follows it Tricks to to make your locking correct Still on the topic of making your locking correct arm Don't invent your own locking primitives like seriously. I mean for one is You're pretty much guaranteed to make the Linux RT people unhappy You're also pretty much guaranteed to get the lock that by notations wrong if you even think about adding them and I Mean that this extends to concurrency primitives in general because When you don't have locks, but like for example complete weight completion or things like that You also need memory barriers, right? for like handing over or synchronizing access to data That's really the hard part like I think We and invented two three our own synchronization primitives and in the RAM and they're all wrong with respect to memory barriers So this is this is this is really hard. The next thing is like in So so use the existing locks because a lot of people have thought a lot about what the exactly mean What are the precise semantics what barriers need to be included? What's the right lock that by notation? How did the semantics change when you enable real time Linux all these things then the next one is kind of pick pick The simplest possible lock or lock design Because generally the the rules are stricter and so you catch box quicker So if you can use a simple spin lock, then that's better than a mutex Or like a spin lock is better than a read write mutex if you don't need it Because there's kind of less things you're allowed to do and more things that lock that can catch for you on the same thing with with concurrency and synchronization primitives So when when like flush work or flush work queue is the right thing for you Then don't invent your own thing with like completions or a weight key Because flush work key has annotations in lock that when when you create certain deadlock scenarios But your worker is waiting for a lock and you holding that lock while you wait for that worker, right? then lock that will complain but if you roll your own locking and and Your own synchronization primitives with weight queues and completion then lock them will say this is looks good and it deadlocks in the production and Final kind of the last priority rate is make it faster really only make it fast enough because the first question You should ask does it really need to be faster? Like in the graphics driver with the mode setting code this plays refresh at 60 frames per second So if you if your codes fast enough to do a few thousand updates per second like it's good You're already way faster than it needs them and the next one is like Only use real workloads to justify Performance tuning not microbenchmarks microbenchmarks are good to like once you've identified a problem Make sure you make full the progress. You don't regress anything else like they get for that But they're not good to justify complexity and One thing that is maybe a bit specific to JPAs, but like we've seen it in other places, too Very much like the an overall better architecture Gives you so much more benefits than trying to micro tune the locking So the big model there is is the job at the Vulcan GPU model instead of open GL It's just like that's actually how GPUs work nowadays and open GL is kind of how GPUs worked in the 90s And it's just doesn't fit anymore. I think another great example is the POS X file IO API versus IO u-ring And one of them is just fundamentally faster and there's no amount of like clever locking tricks So that you can apply to the POS X file IO API to Get to the same level as as IO u-ring So those are kind of the priorities The next principle and This kind of ties into why you should follow lock that but not fight it is you should protect data not code um So protecting data here essentially means you build your data structures like your structs and whatever and then When you need to protect some mutable member in there You have like one rule that holds for all the code How that struck member is protected by which lock or which rule or whatever it is, right and then What this means is for review and testing essentially all you have to do is In testing of like these locked up annotations and review by all reading the code is compare all the code Against these single locking rules, right? but if you like go the other way and Kind of just protect the code against each another like with state transitions where the locking rule changes and other fancy things like that What you have to do with testing is you have to test all the pieces against all the others and the same with review and That just doesn't scale Another kind of example of protecting code and not data is when you sell You don't care about performers and just do one lock for the entire subsystem That tends to be a meant to be maintain ability meant nightmare And it's it's again like in a way if you just protect all the code in the subsystem You could say I'll protect all the data structures in the subsystem But it is really more like protecting the all the code in your subsystem with one kind of big kernel lock or subsystem specific kernel lock and So so this is kind of Why why lock that pushes you in the right direction? Because when you initialize the structure and its mutex The way lock that works it forces all the mutex is for that structure assuming you have like a single initialization function for this like you should Into the same lock that class and so lock that ensures that For a given data structure and even given piece of data that you always follow up the same rules anywhere in the code for you for your given piece of data Unfortunately, there's some some anti patterns. I Think the one that's been most painful for us in graphics is K ref put mutex So K ref put mutex is like a normal K ref put except when you do the final on ref Before it does that it grabs this mutex and this allows you to protect Like weak references in lookup structures and other places by this mutex And when you do kind of these cash look ups rate You can grab the mutex you can look up your object or whatever it is and you guarantee that as long as it's Kind of still under the protection of that mutex the reference can this at least one because the the final release function is Guaranteed to like the final on ref is always done on this mutex and The release function can then remove all that all the caching and lookup caches entries From all the places now the problem is this this kind of protects not the data structure But it protects you against the release function The way this tends to hurt you is What happens if you have like to look up caches with different mutexes at that point you leave like K ref put mutex mutex, right? And maybe you know the one so it doesn't compose the next one is that Your mutex tends to become really big because you need to hold it through your entire release function And there might be all kinds of things you need to clean up while you're holding this mutex Which increases the arts that that you you're facing a dialogue. So that's kind of the second principle and With the principles out of the way, let's let's dive into kind of the different patterns I've seen and try to extract a bit and I've tried to put them like I said, I've tried to put them a bit in a Hierogy from like most maintainable easiest to debug to you you're gonna be in pain and I mean, this is just an overview like the first one is like you don't have locks You have other clever solutions for that the next one is just big dumb lock like every data structure has its own lock So this is not like the big kernel lock subsystem lock thing. That's that's too big and then Certainly scenarios you need more fine-grained locking for functionality that's already Getting complicated and then the danger zone starts when you do fine-grained locking for performance and then really annoying is when you start applying lock less tricks and inventing your own locking or synchronization and consistency parameters so Let's look at the first one. This is this is like any time you can solve a data consistency problem with one of these you should and I Think the most powerful pattern here is I mutable state. I think you create an object and then you publish it by putting it into an x-array for lookup or Adding a file descriptor slot for it or whatever it is, right? And from that point on you never ever change it again Which means there's no synchronization primitives and this is essentially the difference between open GL and Vulkan So Vulkan almost everything is just immutable state and it's really fast and it's really easy and it allows you to do really simple drivers and We had a bunch of cases where people actually made things very mutable and Then deeply regretted it and we had to retrofit immutable state objects behind a mutable you API Just to take care of the complexity in the curve base. Another pattern is the single owner so for example when you have a asynchronous processing you like create some Structure that encapsulates the work and then you hand it off to the worker and then Before you hand it off you you're the owner and after you've handed it over the workers the owner so there's only ever one and Any time you own it you can change it But if you don't own it, you're not allowed to even look at it anymore So kind of the single owner is is a very nice pattern Ideally you implement it with the work queue Primitives because like I said they have locked up annotations, but sometimes you need to use completions and stuff Another pattern from from this group is reference counting Because it allows you to guarantee that Once the release function is called it isn't a special in a way. It's a special case of the single owner pattern because the release function is guaranteed to be the single owner of the object and so it can destroy it without taking locks and clean it up in any order because no one else is looking at it anymore and Because rust is cool. I have to mention it Essentially rust really excels at ownership patterns because it's part of the borough checker So you can encode all these like you have exclusive ownership You're the only one can you can mutate a structure really well in rust So the next level is well You couldn't like use one of the previous patterns. You actually have concurrent access and you need some kind of lock There's a slight missing Okay, I need to improvise that I'm sorry Um and Where is it? Maybe it's No, it's gone. This is annoying. I apologize. Something must have added it And the trouble with kind of the simple lock is if you make it too small, right? You have the problem that you might need two locks to change something and then you have increased chances for deadlocks, right? Because one co-path might need to look at object a or type a and then they and the other first at bay and then at a And then you have a deadlock, right? But you also need to make sure that you don't make it too big because if you make it too big you run into the Problem that you're protecting your entire subsystem or your driver with a single lock and that Again means you is much harder to refactor your code buys Um my experience has been that for kind of the big simple lock approach Our hindsight is the most reliable indicator for whether you got it, right? so unfortunately, you know You will have to like adjust and sometimes split up objects into like smaller chunks and paces or sometimes merge a few things together again This unfortunately happens But that's that's kind of like trying to pick the right size for you Your simple lock is is kind of the challenge with that and a nice thing again like rust excels at this with the mutex guard It nicely enforces that you take the big lock for your object and then you get a mutable reference And when you drop that local variable It automatically gets unlocked So that was the missing slide and we can go To the second level which is more fine-grained locking and this is this is kind of in some cases you need You have like a functional lead to protect pieces of your data structures with smaller locks like the I think one of the most common patterns is if you just keep a linked list of all the objects So you have a spin lock somewhere on your driver and the list head and obviously like each object Might have its own lock But when you need to manipulate the list like maybe move it from one to the other or add it or delete it You also need to take the spin lock. That's kind of nested within I know the very common one is like In drivers we tend to have to deal with interrupt handlers and interrupt handlers can't use mutexes But your driver object might need a mutex because you're sleeping under that lock or you need to allocate memory So you need to nest the spin lock an interrupt say spin log within your kind of object mutex For just the stuff that's shared with the interrupt handler The next one is kind of asynchronous processing if you have like an object and some of the work is done asynchronously If your asynchronous worker uses the same lock right Then you might up in the situation where one thread grabs the lock and waits for the asynchronous worker to finish And the asynchronous worker is kind of waiting for you to release the lock So you have a deadlock and so again like it's kind of similar to the interrupt handler situation You need to have like a subordinate lock which encapsulate just the data that asynchronous worker nates and The fourth one that is fairly common and I kind of talked about it already by Complaining about K ref put mutex is weak references. So if you shouldn't use K ref put mutex to implement weak references, what should you use instead and This is kind of The next slide So instead of K ref put mutex what you can use is K ref get on less zero So you protect like you look up cash for the spin lock or whatever and then when you found your object in there You call K ref cat on get unless zero Which means if the object still exists you get an elevated reference can't but if the reference can has already dropped to zero It's kind of a zombie entry which hasn't been cleaned up And so you fail to look up and this means in your release function You can just take the spin lock remove the cash at the cash look up entry drop the spin lock Take the next spin lock remove the cash look up entry from like another cash If you have like one on the file descriptor one on the device one in the subsystem Who knows and so this essentially means with K ref get on less zero you can make weak references composable, right and The same thing is kind of the flush work issue if you're holding The same lock that you use for protecting data consistency as for like that you're holding to synchronize with the asynchronous processing you also can have deadlocks and I try to I've seen this in a bunch of places and I try to abstract this a bit away as an anti pattern essentially when you're trying to use locking For object lifetime management, so K ref put mutex essentially what the mutex does is Give you at least object lifetime, right? And the same thing with a flush work versus the lock is kind of flush work is about synchronize synchronizing kind of object state and lifetime But the mutex is about data consistency on my experience has been if you mix these up you tend to have their locks and what's worse is Lock that natively doesn't understand cross release. There's been efforts for years to change this And kind of lifetime issues or cross release like code runs. There's like completion There's a wake completion and that's kind of the thing that creates the dependencies. It's not like the locking critical sections and so in general It's both way too easy to create like deadlocks when you mix up locking with lifetime issues or transitions It's worse that lock that doesn't even tell you So you will only find out about these when you're in production which is not great and This is kind of also the reason why you really should use the most specific primitive that the kernel has because the more specific a primitive is The more likely it is that you can add locked it by notations and the respective subsystem container has done it For these kind of cross release dependencies, so like flush work will complain, but if you hand roll this with your own completions it will not and Kind of the slightly worse Case of the second level where you really getting into danger is when you do fine-grained locking for performance But in the principles, I've already talked a bit what you really should do and To be honest shouldn't do in most cases so the final one is I've mostly seen bugs and a lot of good code at least in drivers is when you get into the lock less tricks and They're really fancy stuff. This is where that the section of my blog post and talk where I'm gonna Make a few people angry because I see at least in drivers. I see these only as anti-patterns And the first one is our suit now RC is great. It's fast It's awesome, but the problem or the fundamental problem where I see it gets Into Maintainability nightmare is that? Fundamentally RC works by extending the lifetime a little bit very cheaply Anytime you have a read critical section And so fundamentally RCU violates the entire you should separate object lifetime from data consistency real and if you just Replaced like a read write lock or something with RCU You haven't done anything bad, but then people get really creative and notice that they this this thing kind of keeps the object alive but they start exploiting that and One two years later, you notice that entire GPU virtual memory destruction and release code paths are somehow under RCU Delayed free because everything slowly moved in there and You end up with synchronized RCU and other like delayed points in really bad spots So RCU is great, but because it fundamentally kind of mixes up lifetime with lock or consistency It's in being in my experience really dangerous The next one is atomics one thing that Tricks up people a lot is I mean C++ has since a few years really well defined atomic semantics. They don't match the kernel So anyone who comes from user space gets it wrong the other hilarious thing with a kernel atomics Is that some kernel atomics don't have an atomic in their name? so the I think the most Tricky one is the bit ops We're like set bit and test bit are the atomic versions and the one with the double on the score and the not atomic versions and so The Relaxing code where you can be at ease is the one with the double on the score and the one without the double on the score It's the one that should freaky out and make you grieve you things carefully Another anti pattern is all these preempt is able local IQ IQ disable like bottom half disable for one is um They annoyed the real-time people because they're really badly defined semantics. Luckily, there's like local lock I think it's called there's a replacement Which has much better semantics and essentially in my opinion moves these co-patterns from The purgatory level out into one of the good ones essentially like the level was just a simple lock It's it's in a way. It's a mix between ownership roles and and the lock but like the hand roll stuff is Very hard to debug understand and maintain and in all of these things is sooner or later need you need you need memory barriers and so I Would say memory barriers are a anti pattern in two cases one is in some time some cases people just don't Bother about them, which is very obviously broken in almost all cases Or they do bother about them and then you should be really freaked out because my experience has been That unless you've done essentially the equivalent of a PhD in Lockless algorithms, you can maybe spot box, but you can't prove correctness. So we have I Think two or three of these in the DRM subsystem. We're trying to get rid of them because turns out they broken Or we didn't we weren't careful enough in like Just adding the right amount of memory barriers and so To close out I figured that a nice a nice case study of One example from DRM, which I think Worked that really well, which is the atomic mode setting. So this essentially is is It's implemented as transaction as atomic transactions to your display hardware and unlike in a database. They usually have a Try a try commit rollback approach like with real hardware once you start filling with your display You can't really roll back because the user has seen that things are starting to turn off and stuff So what we instead have is we have like a check commit split and the commit is not allowed to fail and In the check phase which could potentially run concurrently We have essentially per object locks We achieve composability of these per object locks with the weight vamp mutex graph locking which means drivers can take these locks in any order they feel like and That you can even like check stuff in parallel on subsets like if you have your display hardware split up into two seats with two independent compositors running and In the commit phase, it's pure ownership So the the commit phase is done with If users base asks for it and that's synchronous worker and essentially once once we've put in the new software state under the protection of the lock Actually writing the software state into hardware with the commit phase Which was not allowed to fail right so that's why we can do it is is pure ownership rules and the subsequent transactions synchronized with completions unfortunately we can't use the work queue primitives because you can make networks of transactions with because sometimes hardware resources are shared and things like that and I think the part that's really nice is that At least for you standard driver all the locking and all the ownership Handover is implemented in the framework So the driver people can actually just write the MMI or sequences to light up the display and they have a correct driver Mostly I mean there's this some special cases where you they you need to think about your own synchronization and stuff But we have helpers for that Yeah, so This this is kind of I think a pretty good example where we achieved The second priority of making it correct All almost for any dumb cat and I think it took a bit longer than I hoped for anyway Here's the summary and maybe we have time for a question or take Have like half a minute. So maybe let's stick with one question. So Daniel. Thank you very much for your talk