 Okay. Well, thank you all for choosing to come to my talk. Freedom from choice, right? What do you get with a single track? Okay. So, what I want to talk about is an overview of locking in the FreeBSD kernel. I don't really think of this as a keynote talk. I think of it as sort of a technical talk. But you can go out, you know, thinking about locking, which some people seem to spend a good deal of their life doing. It was pointed out to me that that is not the Polish spelling of university. It's also not the English spelling of university. It's just a spelling error. Okay. So, I'm going to start by giving you sort of a talk about how we historically did synchronization. That was back in the dark ages before many of you were born. Dinosaurs and mainframes ruled the earth, et cetera. We'll then move into the modern age with how we historically did move into the modern historically did move into the modern historic. And then I'm going to go through turnstiles and sleep cues, which is the underlying mechanism that we use to implement locks. And then I'm going to go through and give you sort of a slide by slide detail of each of the types of locks that we have in the system. I'm trying to motivate why it is that we need about eight different ways of doing locking. And then finally we'll talk about the witness system, which is the mechanism that we use to avoid deadlocks. Okay. So, the historic synchronization was done as shown in this slide. We would start by checking for a resource. So just to make this concrete, let's say we need the page that represents the password file. And that will normally have been brought into memory. It will be sitting around in memory somewhere. So we'll go find that page in memory. And then to check for the resource, we need to see whether it's available or not. And there's just a couple of flags that are defined there. One is a lock flag. A lock flag is set. It's in use. And if the lock flag is clear, it's available. So if we go and we find that the lock flag is set, we know that someone else is using that reading, writing, whatever. And so we can't have it right now. And so there's another flag, the want flag, and we set the want flag. And then we go to sleep on it. And when we talk about going to sleep on it, what does it mean to sleep on a resource? Well, for those of you that took my tutorial, you know that forever is not a word that you want to use in the same sentence with operating system. So we do not want to sleep forever, for example. And so you have to tell it why it is you're going to sleep, what resource you're waiting for. And if you just decide that you stayed up too late last night and want to nap, you actually have to set an alarm clock. So in this case, the identifier for the resource is just the address, the first byte in the resource. And so you'll see sleep being passed all these pointers to proc entries and buffers and other things. It's never dereferences them. It's just using that as a unique number. Okay, so we go to sleep on it. And then at some point in the future when it becomes available, we'll be awakened. So whoever comes along and doesn't find it blocked, i.e. it is available, we'll start by setting the lock flag. And then they'll use it, possibly sleeping in the process of using it. Anyway, when they're eventually done, they will clear the lock flag to indicate they're no longer using it. And then they check to the want flag to see if it's set. And if it is set, then they know that while they had it locked, some other process came along that wanted it. And they will end up waking up all the processes that are sleeping on that resource. Why all of them? Why not just one of them? Well, again, we just want to be able to be absolutely certain that we're never going to leave anything asleep forever. If we just wake them all up, those that care about it can dive in and fight over it. And the rest of them can wander off and do whatever. If we know that the processes will never wander off, they will always come around and use it, we can use a version of wake up called wake up one that wakes up just the one that's been sleeping the longest. And then when it's done, it'll wake up the next one and so on. But if you're not sure, then you just do a wake up on everything and let them fight it out. And of course, if you've done locking properly, if you've got the right level of granularity, you'll typically never have more than one thing waiting anyway. So it shouldn't normally be an issue. Okay, so all of this business of setting want flags and lock flags and so on worked fine because we were running on a uniprocessor and there was a single thread of control running and we didn't have to worry about checking that the lock flag was not set and then between the time we did that and set the wait flag and went to sleep, the other thing woke up, et cetera. So it was very straightforward, very easy to do. Once you start getting multi-threading going on in the operating system, now you have to be much more careful about the way you manipulate these things. When you get down to the bottom of it, a lot of the locks we're going to look at still have things like lock flags and want flags, but they're hidden inside the locking API and they have mutexes around them to make sure that the right things happen and so on. Okay, so that's the old world. What's the new world? The new world is that we have this whole hierarchy of locks and this slide goes from the lowest level locks, which is the hardware all the way down to the highest level, which is the witnessing code that's making sure that bad things aren't happening. At the very bottom level, we need to have hardware support. We cannot implement locks without some support from the hardware itself and the sort of generic low-level hardware operation that we need is something that both reads and writes a location in memory with no intervening operations on that memory location. Test and Set is the traditional thing that you would learn in the textbook. That is we pull an existing value out of the memory and set it to, let's say, one and then you look at what you pulled out and if what you pulled out is zero, you know that you got the lock and if what you pulled out is one, you know that somebody else had the lock and because those two operations never interleave, you don't get one thread that pulls out a zero, another thread that pulls out a zero and then them both sticking one in there. Once I pulled out the zero, I'm guaranteed that the one is going to be put in before any other thread is going to be able to look at that memory location and that locking is something that the hardware has to give to us. Now as you'll see, we actually use something that's a little higher level than Test and Set, we use Compare Exchange, which I'll talk about the method that we use for using that operation, but it's the same idea, we pull a value out and then we stick another value in there. Now with this low level interface, low level instruction, we can then start to build software locks and the lowest level lock that we build is what's called a spin mutex. So a mutex is a mutual exclusion, so it's single user exclusive use and the spin lock is going to work by looping over, at least logically looping over Test and Set, it's going to do Test and Set, did I get it? No. And it's just going to wait until someone else comes along or whoever holds it comes along and puts a zero back into that location to say that it's now available. Now, obviously a spin lock is, you're not getting anything else done, you're just sitting there and so these types of locks are only used for operations that you do not expect to take very long. It's something like you want to insert something into a linked list, so you lock the list because you don't want someone to get corrupted while you're changing the forward and backward pointers and then you release it. So small numbers of instructions and hence it's actually more efficient to just sit there and wait for something to become available than it is to context switch away and then come back later because context switching away is typically several hundred instructions, coming back is several hundred instructions more and if you're only going to wait for ten instructions, you're better off to just wait. Okay, now, as you'll see, spin mutexes are actually used in just a very small number of places in the system because what actually makes a little more sense is what's called an adaptive lock and that's going to be this whole next set of locks here. These are locks that can block briefly, but they're not allowed to go to sleep. Now what do I mean by blocking briefly? The thing that really sort of identifies this class of locks are things that someone else may hold the lock but the only reason that they hold it is that they're going to do some set of operations and when they've done that set of operations, they're going to release the lock and so if something is locked, then what we will typically do is just spin and wait for it if the process that holds the lock is actually running. If the process that holds the lock is not running, typically because they're waiting for some other lock to clear, then we will block in particular so that we can hand potentially the CPU over to whoever holds the lock because if they simply get an opportunity to run, then they will do so and give up the lock and then when they give up the lock, we grab the CPU back again. So this blocking, as you'll see, is only a very brief block. It's not going to sleep and going off and doing something else as you'll see for this later class of locks down here. All right, so in this class, that are built around this concept of blocking but not sleeping, we have, first of all, the blocking mutex, which is spin for a while and then block on a turnstile. I'll describe turnstile shortly. We have what are called pull mutexes. It turns out that creating a mutex is a non-trivial amount of work. You have to initialize the mutex, which clears the memory and if we're running the witnessing code, registers the mutex with the witness and then when you're done with the mutex, you have to unregister with the witness and pull it out of the various lists that it's on before you can discard it. And there's some data structures that we need to allocate that don't live for very long. The classic of this is a select system call where you come in, you create a bunch of data structures corresponding to all the things you're interested in finding out about and then when something becomes ready, then we clear all those things out and tell you what's ready and so the lifetime of the piece of memory that's allocated for each of those is the lifetime of one select call and it turns out that creating and destroying the mutex would nearly double the cost of setting that thing up and so we just keep this pool of pre-allocated mutexes and when we need one briefly, we just go to the pool, grab it, use it and then just give it back to the pool and so we don't have to initialize it, we don't have to free it and in fact we don't even need to allocate space because they're fairly big things. We just have a pointer that points at this mutex. So they're just like the blocking mutexes but it's a pool of them that are available for short-term use. Next up on the hierarchy are reader-writer locks. These are mutexes that have shared exclusive semantics so unlike the sort of low-level mutexes which are purely mutual exclusion, these are, you can have shared so you can have multiple people that are reading or of course a single one that has it for writing. And then finally in this class we have something called read-mostly locks and these are locks that are optimized for the read case so if you have something that you're mostly reading and only occasionally writing then you would use these sorts of locks because the read case is quick and the right case is much slower than it would be for one of these others. And again I'm going to give you the details of these as we start drilling down into them. All right, the other class of locks that we have here are things that are using the sleep queue interface. They look a lot like those traditional ones that I described and in fact these are really just kind of wrappers around that sleep and wake up paradigm. We have shared exclusive locks. These are fast and simple versions of sleep locks. These locks are used in the instance where you have some sort of a long-term event. A long-term event is going to be something like I need to do disk I.O. and so I'm going to be blocked waiting for the disk to respond so maybe 20, 30 milliseconds. You can get a lot done in 20 or 30 milliseconds that's obviously millions of instructions and so it's well worth going to sleep and letting other things run and then when the I.O. completes we'll switch back. This is also going to be events like waiting for network traffic, waiting for users to type things at keyboards. Some of these long-term events can be very long-term events. All right, we have condition variable which is really just a wrapper on that traditional sleep and wake up and then finally we have the lock manager. These are long-term. This is the all singing, all dancing. We do everything interface. So you have shared and exclusive and you can upgrade and downgrade and drain things in preparation for freeing them and other things of that ilk. And then finally, the bane of locking is deadlock. So I'll talk about what deadlock is for the three of you that don't know and then talk about how we can monitor what's going on to avoid having that happen. Okay, so before we actually look at the locks themselves, I want to look at the data structure that we are going to use in order to manage these locks. Because if you understand the data structure, the locking just kind of falls out. So these things are called turnstiles. These are used by the blocking mutex as the reader-writer and the read-mostly locks and of course the pool locks. As I've already said, these are designed for short periods, typically tens of instructions. And what you actually use them for is to protect read and write access to data structures. So for example, with the locking that we looked at earlier, we need to have a mutex around checking to see if the lock flag is set and setting the weight flag because we don't want to have a race condition where we check it and think it's locked and set weight only to have it be freed while we're doing that. So that's the sort of operations that you'll typically see mutex. They're usually in the same place on your screen. You know, mutex, a few lines of code and then free the mutex. Okay. One of the rules that we have is that you're not allowed to hold a turnstile, essentially a mutex, when you request a sleep lock. It is not permissible to go to sleep holding one of these short-term mutexes because if you did that, now the mutex is going to be held for a very long time, potentially, and we don't want that to happen. In particular, one of the rules about these mutexes is that you can always get it to be released simply by scheduling whoever holds it to run. Now, as you'll see, that may recursively cause other things to run. But the point is that the only reason the thing is held is because it simply either hasn't gotten to the end of its critical section yet or it's just waiting to be scheduled so that it can do that. All right, we need to track the current holder of the mutex. And the reason that we need to track the current holder of the mutex is for something that we call priority inversion. One of the issues that we can have is that some relatively low-priority process is sort of plotting along doing its thing and it acquires a mutex and now some really high-level, high-priority important critical thread comes along and it needs that mutex and it can't have it because it's locked. And the problem is that this low-priority thing may not get to run for a long time and we need that mutex now, not sometime when that other process gets around to being scheduled. And so what we do is something called priority propagation. So this high and mighty thread comes along and it sees that this other low-level thread is not running at the moment and so what it will do is it will block and hand the CPU to that low-priority thread and say, as of right now, you're in my way so you're important, have this high-priority and bam, the thing starts running, it goes a few more instructions, it releases the mutex and suddenly it loses all popularity and gets whacked back down where it came from and the priority comes back to the high-priority thread, who now rips on through. We had this experience actually when we took the train from Berlin to Warsaw. We were on the express train and they were doing a lot of track work and so they had it single track. The express train would get there. All the other trains would be stopped so we could go the wrong way down the other tracks. We'd come out the other side of this thing and there'd be like 14 trains and all these people staring out, but who's forcing us to sit here for 15 minutes? It's like, hi. Okay, so let's take a look at the implementation of turnstiles here. So I'm going to sort of flip back and forth between this slide and the slide as we have the picture. So the first thing that we need to be able to do is to quickly find the turnstile that's associated with the lock and so we have a hashing header obviously and then a set of pointers. So if we have in this picture across the top, if I find the button, across the top here is the hash header and now down the left-hand side you can see I have six threads, two, three, four, five, six and then these boxes out here are the things that are keeping track of what's going on. So we can see up here that thread number one is the owner of... This pointer here says this turnstile is referencing lock number 18 and the owner of that is thread number one over here and then we can see from thread number one the things it actually owns is this, which is lock 18 and then coming down over here it also holds lock 15. Finally, we can see waiting for this lock number 18 is thread number two and also thread number three. So... Hello? If we have a question of who owns lock number 18 we can just do a hash on... The hashing header is here based on the lock number and so 18 would hash to this list here and so then what we do is we go here, we see yes, that's lock 18 and hence we can quickly identify that thread number one here is the owner. I'm not quite sure why it's doing all this. Okay, so... The... The hash header then is the thing that allows us to need a turn style each time a thread blocks. So if a thread blocks then we need to have one of those turn styles that tells us who... We now... When thread two comes along we need one of these structures. Now we could have one of these structures allocated for every single lock in the system but in fact if you think about it a thread can only ever be blocked from one thing at a time because if it's blocked well it's not running and it can't therefore try and get something else. So instead of allocating one of these turn styles for every lock in the system we allocate one for every thread. So when part of creating a thread is you also allocate a turn style to go with it and so when thread two shows up and wants to get lock 18 it's blocked so it hands a turn style structure over to the kernel and the kernel then uses it to link this in. Now thread three comes along and it also blocks and you'd say well it's going to give its turn style over but we already have this turn style so you would think well we don't need the one from thread three but we'd still take it so you see this one over here it's not linked into anything it's just hanging off this list of extras. And the reason for this is that once thread one gives up the lock we're now going to be able to give it the lock to thread two. Well thread two is now going to be running and so it's going to need one its turn style back but it's turn style is still in use here tracking the fact that thread number three is blocking. But luckily we have thread number three's turn style that it turned in here so we just give that to thread number two not the one that you gave but you'll get a turn style and now thread two can go wandering off on its way and now finally when thread number three gets to run after thread two has released the lock now there's nobody waiting so we can pull this out of use here and turn this turn style back to thread three and let it go wandering off. And the reason that we allocate one per thread is that there are far fewer threads in the system than there are locks. If you think about it every process entry has at least two or three new texes in it itself all by itself and then of course there's new texes all over the place for lots of other things so it's actually a lot more efficient to just have one per thread than it would be to have one per lock. Okay so that's how these things get managed. So going back here turn styles needed when it blocks one at a time so it provides its turn style the unneeded turn styles are saved and returned when the thread awakens and this is the priority inversion that I've already talked about if the holder of a lock has a lower priority than the thread that's about to be blocked we recursively propagate the higher priority to the holder but only until it releases the lock so it propagates to whoever holds the lock if they're blocked by some other thread then it will propagate down to that thread and it will keep going down until we find a thread that is just waiting to be scheduled and then once we schedule it it releases the lock and then it just sort of bubbles all the way back up again. Okay so turn styles are really sort of the crux of how all this is going to work and so if you just sort of keep that in mind the implementation of the locks becomes fairly straightforward. Okay sleep cues look a lot like the turn styles I've just described these are used by the longer term locks designed for long term periods there's no priority propagation that goes on with sleep cues because they're not waiting for CPU, they're waiting for some event to happen. Making a process that's a sleep waiting for disk IO runnable isn't going to make the IO finish any sooner. It's going to finish when it finishes and it's pointless to give it a higher scheduling priority. You cannot own a turn style type of lock when you request a sleep cue because that would then make it impossible to do priority propagation like the others we do track the current exclusive lock holder but by default we do not track all of the readers of a lock. As you'll see we actually have one interface that allows you to do that if desired and then these locks are allowed to be recursive. Now what do I mean by a recursive lock? What I mean by a recursive lock is that you can come in and grab an exclusive access to a lock and then later come along and ask for the same lock exclusively a second time and you'll get it and what will happen is we just keep a reference count of how many times you've asked for an exclusive lock and it just goes up, up, up and then each time you do an unlock we decrement the count and when the count goes to zero then the lock is released. Now you might wonder why it is that you would want recursive locks and the answer is normally you don't if you ask for the lock that you already own a second time that's usually a programming error. We sometimes refer to these as bugs. So by default locks do not allow recursion and if you ask for a lock that you already own you will panic with locking against myself. However there are places where it is sensible to have recursive locks. An example of this is the layering that we have a file system so you can have a ZFS file system and then sitting on top of that it can be NFS which is exporting it somewhere else so you come into the NFS layer and it locks the object that you're trying to deal with and then it actually gets passed down to ZFS to provide the actual data and ZFS being unaware of the fact that the request is coming from NFS goes ahead and locks the thing that it's about to manipulate and if we didn't have recursion at that point we'd panic and in the old days we didn't allow recursive locks and so there was all this flags you'd pass up and down saying I've already got it locked don't ask for a lock and that you just never get that right so and besides there's a horrible layering violation so we just said fine we'll just allow locks to be recursive and so it can be locked by NFS and then locked by ZFS and then ZFS unlocks it and then NFS unlocks it and then it's released okay but that when you create the lock you have to say that this one should be recursive if that's the case of course people get the panic locking against myself sometimes their solution say oh just make the lock recursive it's like no think about this guys there's a reason we have that panic in there okay so let's now just want to work our way up through all the various different locks that we have the first thing that we have isn't really a lock per se it's what's called a critical section and it uses critical enter and critical exit and when you go do a critical enter that says that you may not be descheduled and you may not be moved to another CPU you are on this CPU you're not going to be moved anywhere else you're not going to be preempted you're just going to run until you do critical exit and so the critical sections are used much like the old single threaded kernel used to do things called SPL where you could block out classes of interrupts for a period of time so you knew that an interrupt wouldn't come in and mess around with things they're useful for protecting per CPU data structures so if you have a run cube that's only ever going to be accessed by that CPU you can just use a critical section to protect access to it and then you don't need to have locks and mutexes around that thing it does not protect system wide data structures because other CPUs can continue running and so if you have a data structure that can be accessed by two different CPUs the critical section is not going to help you in that case but there's a small amount of per CPU data and it is protected with these critical sections okay now I already sort of gave you the thumbnail sketch about the hardware requirements for locking the minimum requirement is test and set on modern hardware there's compare and swap which is what FreeBSD uses now you need to be a little careful about how to implement these instructions or use these instructions you'll notice on most architectures that there's sort of two flavors of test and set or compare and swap there's the sort of lightweight version which just makes sure that within if there's a context switch that the instruction will have completed or will not have started it won't get so halfway through the instruction but it's not locking against other CPUs and then there's the test and set interlocked which is historically locked the memory bus so that nothing else could get on there while it did the read and the write operation well locking the memory bus is not real good for performance if you do a lot of that so you do not want to create a spin lock by doing test and set interlocked and just keeping the memory bus codily locked up all the time so what you'll typically see is that you do the hard core test and set interlocked and then if you don't get it you just do the lightweight thing to wait until it looks like it's going to be available and then you do another single one of these hard core test and set interlocked type of instructions and you may or may not get it because someone else may beat you to the punch but you want to avoid doing the thing that's going to cause this memory bus locking now modern architectures don't lock the whole memory bus anymore but they do lock a chunk of address range and so it's still a good idea to avoid using those anymore than necessary now the compare and swap instruction instead of just doing a single bit takes a thing to compare so pulls out the entire value of the word compares it with the thing you've asked it to compare and then sticks in whatever it is that you tell it to put in there so the owner field for a lock that's free contains this scratch defined thing called mutex unowned so it says this thing is not currently in use it's free, it's available and then the owner field for a lock that's held by somebody holds contains a pointer to the thread that actually owns the thing so what will happen is that when we do the compare and swap the allocation attempt compares the lock owner with mutex unowned and then if it matches if the thing that it pulls out matches mutex unowned then it stores into that location a pointer to the thread that's just acquired if it doesn't match mutex unowned i.e. it's already held then it leaves the previous owner value in there and that of course is what we have to look at so after that instruction we look at what the previous value was and if the previous value was a pointer to some thread then we know that it was held and in fact we know what thread holds it if the value that we got back was mutex unowned then we know that our pointer has been stored in there and we are the happy owners of this mutex ok so when we're done with it it's very trivial to give it back we just unconditionally store mutex unowned into that field and that then makes it available for someone else to grab ok so that's the sort of low level mechanism that we have for implementing all of these other locks ok so starting with spin mutexes this is exclusive access only it's going to loop waiting for the mutex to become available so it's just going to sit there trying over and over and over to get it and it runs inside a critical section while the lock is held or while we're doing that to essentially avoid deadlocks and I don't really have time to explain the deadlock situation it's actually more expensive to obtain than a blocking mutex and in FreeBisD it's only used for low level scheduling and context switching so other than that one little domain everything else is going to be a higher level type of lock like a blocking mutex ok blocking mutexes are also exclusive access it uses adaptive spinning so it'll only spin waiting as long as the holder of the lock the current holder of the lock is actually running so if the current holder of the lock is not running then we will block typically so that we can hand our CPU over to whoever holds the lock so they'll finish using it and then assuming we're higher priority we'll grab it back as soon as they're done with it when a when we finish using a lock you remember there was that list of others that were waiting for access to the lock and in fact we awaken all of them we do not just awaken the first one we go through we awaken all of them the reason for this is it's much cheaper to release an uncontested lock because all we need to do is store that value in there we don't need to run around and deal with turn styles and run down lists and all those other things and you'd say oh yeah but then they're all going to dive in potentially on separate CPUs and they're all going to just fight with each other and we're going to end up creating turn styles again but that's not going to happen because of the adaptive nature of these locks what's going to happen is one of them is going to get it and the other is going to see that the one that has it is running so they're just going to sit spinning waiting and then when they give it up the one that's spinning and waiting will get it so the other thing is that if they're all at the same priority then typically they're going to get scheduled one after the other anyway but in any case the upshot is that it's just cheaper to just wake everybody up now again there's a certain assumption going on here that you have fine enough granularity on your mutexes that you're not going to end up with hundreds of things waiting on a mutex if you've got hundreds of things waiting on a mutex you need to rethink the way you are implementing that data access okay pool mutexes again I've pretty much given you the details on this it's used for small short live data structures something where you just need a mutex for a little bit and it's not worth the whole cost of setting it up and tearing it down and it keeps your data structure small because it just needs a pointer instead of embedding an entire mutex and as I said the typical example is the polling system where you need a structure just long enough to track a request and during a period of a single system call okay the last of the short term locks is actually not the last the next the last is the reader writer locks in addition to the exclusive access that you get it also provides shared semantics and it uses a turn style so obviously you can't sleep as with any of these other short term locks it has priority propagation for the exclusive access but it does not provide priority propagation for the shared access and again unlike the others these type of the reader writer locks you can say that they're recursive in the way that I described okay the last of the short term locks are these so called read mostly locks they have the same properties as the reader writer locks except they do add priority propagation for shared access where they can and they use the problem is that you need a data structure to keep track of who all the readers are and so if you want to have the readers tracked in order to provide priority propagation the interface to it requires the caller into the lock to pass in a data structure that's going to be used to track all the readers so each time a reader comes in they hand in one of these structures and it's really just a link list and each element there then points to a reader and so you just add that little structure on to the list and set it's pointer to point to who gave it to you and in this way we can figure out who all the readers are so if a reader comes along we have a writer that's really low priority holding the lock and we got a bunch of readers and if a high priority reader comes along they can propagate their priority to the writer so the writer will finish using the lock and now the readers can get in okay it is designed for faxed access for readers I think I've said that about 18 times now and it assumes that there's not going to be very many writers and so we do what's called opportunistic locking that is we just assume that there's not going to be any writers and then it really turns out after the fact that oops we were wrong then we have to back up and do the whole thing the hard way so the place that you find this for example in the system is something like the routing table the routing table doesn't change all that often but you're looking up routes pretty much on every packet that's going off the machine and so lots of readers of the routing table not so many writers and so having the read case be optimized to be fast is a benefit okay so the actual best way to implement read mostly locks is patented by IBM now IBM allows gpld code to use their patented implementation at no cost but if you're not gpld then you're not allowed to use the patent and there are obvious motivation here we went to them and said well you know we're open source software can't we use it and they said well basically you know if we let you do it then essentially it makes the patent worthless to us because now anybody can pretty much use it at that point they had a point but what it means is that since the FreeBSD is not gpld this actually it works against us because we can't use this optimal solution so we have to use a slightly slower technique the fact the matter is that it's like well it's pretty close to that but it's just far enough off that it doesn't violate the patent and surprisingly it looks rather similar to that technique and isn't really very much slower I think that's enough about that since I'm on video okay shared exclusive locks it's the fastest and simplest and simplest of the locks that are allowed to sleep and it provides a leak shared in exclusive access you can specify that it's okay for it to recurse some of the things where it starts to differ is that the request you can say that you will allow it to be interrupted by a signal and in fact if it's a really long-term sleep you should allow it to be interrupted by a signal so the sort of rule of thumb is that if it's a lock that might be held for more than about a second then you really should set it up to allow signals to come in if it's something like disk IO since disks never flake out and always respond within 20 milliseconds it's fine to just say if a signal comes in just hold that signal pending and let me finish doing what I'm doing and then after I've you know gone through and gotten ready to return back to the user we'll post the signal at that time it makes the code very easy because you know that when the sleep wakes up the only reason it wakes up is because whatever you were waiting for is done and so you don't have to check anything you just proceed and go if it's a long-term issue for example you're waiting for the user to type something you don't really know when the user is going to type the next key and now some of these users go off and take five week vacations I mean work and you know it's going to be five weeks before they're going to type the next character and you know holding the signal that long is probably not a good plan so in the cases where it's potentially a long-term sleep you set a flag saying I'm willing to be interrupted by a signal what this means is that when the sleep wakes up when you return from the sleep you have to check and see did I wake up because what I was waiting for is done or did I wake up because a signal came in and if a signal came in then you have to clean up and whatever it was you're doing return back up post the signal and then if the user returns from the signal handler then and has requested that system calls be restarted which is a default then you have to come all the way back down and get everything all set up again so it's much more coding and things to worry about if you say that you can be interrupted by a signal so you just as soon do not do that but if it's going to potentially be a long time you need to this has very limited upgrade and downgrade capabilities and like all sleep blocks does not implement obligation okay next up is condition variables this is really just a wrapper on the traditional sleep and wake up so you can wait you can have an optional time out you can specify that you want to be interrupted by signals the time out just says I want to go to sleep and wait for this but if it doesn't happen in 10 seconds I want to wake up anyway because I'm bored and want to do something else so if you have a time out you can specify that if a signal comes in you should be awakened and of course if you specify either of those things you need to check and see if that's what happened when you return from the sleep it does allow you to wake up just one of the potential waiters or you can say wake up everyone if you're not sure which one of these you should use use the one that wakes everybody up because otherwise you could potentially end up with something sleeping forever so if you're not sure do the wake up everyone if you absolutely know that anybody that's waiting is going to play well and come along and wake up the next one you can go ahead and do that but much as with the mutexes waking everybody up doesn't typically end up with everyone diving in and going back to sleep they tend to serialize okay the goal is that you have to hold a mutex before you wake up somebody or if you're going to wait so you have to get a mutex and then say I want to go to sleep and what will happen is that this is to avoid a race condition whereby you don't quite get to sleep get to sleep before someone else comes along and tries to wake you up and they go to wake you up and you're not asleep yet so they don't see you and now the wake up is lost you go to sleep and you're asleep forever so what ends up happening is you set this mutex it then gets you all the way to sleep and then as the last step of putting you to sleep the mutex is released and now when someone wants to wake you up what will happen is that again the mutex will be used so that the wake up won't be done while you're in the process of going to sleep and so you get the mutex then you call the wake up and the wake up will be the mutex as part of doing the wake up okay finally we have the all singing all dancing lock manager locks the full featured it provides shared and exclusive access you can do recursion you can do timeouts interruptions by signals you can do downgrades you can do exclusive upgrades which say I want to upgrade from shared to exclusive and I don't want anyone else to sneak in and get exclusive the last reader going away and me getting it and if you can't provide that to me then return an error rather than giving me the lock you can have the ability to pass ownership of the lock from a thread to the kernel this turns out to be a very useful thing for things like IO because some application is going to come along it's going to do a write we're going to lock the buffer while the write is being done but the application typically doesn't wait for it hand it off to the disk and now we return back up to the application later when the IO completes then that lock is going to be released well first of all by the time the IO is complete the process that initiated may not even be around anymore may have x didn't be completely out of the system so it can't unlock the resource and there's a check that says whoever unlocks it has to be the one that locked it you don't want some other random person to come along and say oh it's locked well good you unlock it now let me have it so it's again a panic if you unlock a lock that you didn't acquire well that's not going to work very well when the disk driver tries to unlock this thing that is finished doing the IO on the process that got the lock in the first place doesn't even exist anymore and so we have to have some way of dealing with this so what happens is if you're going to hand it off to the disk driver you change the ownership I am no longer the owner of this I am passing ownership of this lock to the kernel and now when the device driver is done it's going to do the unlock it's the kernel and so that's fine and you won't get a panic okay we have the ability to drain all of the accessing threads so if this is a lock that's embedded in a data structure that I want to free I want to make sure that there's nobody like blocked waiting to get access to it essentially calling the lock manager and say drain this lock don't let any new requests come in and when all the ones that are already waiting have proceeded through so there's nobody waiting for it let me know and then once you've done that you know it's now safe to de-allocate it and of course like all sleep locks it does not do priority propagation okay so we finally get to the pièce de résistance and that is how do we deal with deadlock well first of all in order to have deadlock you've got to own two or more locks if you only ever own a single lock you're not going to get deadlock but unfortunately nice as it would be to set a rule that says you're never allowed to own more than one lock at a time that is not a practical thing to do and so we need to have some mechanism for avoiding deadlock well the sort of issue is to just want to talk about how we get into a deadlock situation here is we have two threads, thread A and thread B so if thread A comes along and acquires this lock R1 and that will be fine it gets it, it's locked thread B comes along and asks for R2 and that's fine, it gets it and now thread A comes along and asks for R2 we say oh well that's already held you have to wait until that's available thread B comes along and asks for lock R1 and we put thread B to sleep saying oh well that's not available we'll put you to sleep until it becomes available and it could take a long time so we traditionally in operating systems if you sort of look back in the old days first of all there weren't all that many locks but more to the point was they didn't tend to get in deadlock they just had a deadlock manager process and it would just wander around and look for situations like this happening and if it saw that it happened then it would just arbitrarily pick one of these and say okay you dead and that solved the problem which was fine unless it was your max that you hadn't written out the file in two hours in which case you might be a little cranky so when the time came to do locking in the UNIX system we decided that rather than running around and trying to find deadlock which is actually not all that easy I mean this one is sort of obvious but they can be a lot harder to detect we wanted instead to simply come up with a way of ensuring that we never could deadlock so what you need to do is to put on what's called a partial ordering of your lock requests and the two rules for partial ordering is we take all locks and we put them in classes so class one here has R1 and R1 prime and R1 double prime class two is R2 and R2 prime and R2 double prime and now the rules are a thread can only acquire one lock in a class so you can only acquire one, you can't get R1 and R1 prime just one of those and then the second rule is you can only acquire a lock in a higher numbered class than the highest numbered class for which you already hold a lock so in this case if you hold R1 you can allocate something out of class two but if you hold class two you're not allowed to ask for something from class one so in our example here thread A comes along and gets R1 that's fine thread B gets R2 that's fine thread A comes along asks for R2, R2 is a higher class than the one that R1 is in so that's fine, it blocks now thread B wants R1 and it goes oops that's a lower numbered class than the lock I already hold so I have to release lock R2 and then I request R1 well of course as soon as it releases R2 thread A is going to get it thread A is going to run, thread A is going to release both of these locks and now thread B is going to get R1 once it gets that it can now come back and reacquire R2 you'll actually see places in the code where you'll be coming along and we'll release a lock acquire a lock and then reacquire the one we just released and if you're wondering why in the world that code is there it's precisely to make sure that we follow this set of rules okay well in the old days this set of rules were codified in the comments so every data structure that had locks associated with it would have comments explaining which class it was in or this or that and of course everybody read the comments they were completely up to date and so we didn't have any problems unfortunately once we started putting in the multi-threading stuff now in addition to all the sleep locks which we'd had traditionally you've got all these mutexes and there's just this explosion of number of locks in the system and keeping track of what the classes are particularly if it's an area you don't normally work in I know the file system lock hierarchy but you get me in the networking my head trying to figure it out and so it's it's difficult with so many locks to just be knowledgeable about the order in which you can get them and some of them are non-obvious orders so what got introduced was this thing called the witnessing code and the idea is the witnessing code is it's going to actually keep track of these classes and you know the hierarchy the ordering of these classes you can watch every lock allocation or acquisition and freeing to make sure that the rules get followed and if they aren't followed then it's going to complain about it well this still requires that we figure out what all these classes are and what the correct ordering is and that's a lot of work all by itself so although it's possible for programmers to define these classes and a little bit of that needs to be done for the most part we just let the witness code figure it out and so what happens is that the witness code simply observes the way locks are being used and from that intuits what the classes must be and you'd say well what if it gets it wrong well when it gets it wrong that's when the programmers go in and define things for it and as I say there's probably 10 or maybe 20 of those definitions and once those are in place it's sort of the Rubik's Cube there's just a few little variables and once you've got it sort of locked down it just is what it is and the fact of the matter is that the witness code figures it out pretty much as the system boots up because that's when many, if not all most of the locks are allocated certainly all the classes are being defined and the startup code doesn't change a whole lot and so the ordering that it discovers during the startup together with the hints that it's been given pretty much locks it down it's very very rare that some change gets made and causes you to have to go fix it now when one of these lock order reversals occurs you actually get a lot of information what it will do is it will essentially give you a back trace of where you are at the time that the second lock was acquired and it will say in that function at that line you acquired this lock and now you at this function at this line you aren't trying to acquire this lock and you're not allowed to acquire that lock when you hold this other one and usually you look at it and you go duh and fix it sometimes you look at it and you go huh? and in fact there is a mailing list called the LOR list the lock order reversal list where some of these really nasty problems get posted it's like well I got this how can I possibly fix it and so then you know endless debate rages and eventually people come to a conclusion and the kind of places that it gets really nasty is things like just to pick one of my favorite things to pick on the networking code where you have a flow of data that starts at the socket and flows down out to the network and you have a flow of data that's coming from the network up to the socket well the obvious order to lock is top to bottom for one bottom to top for the other suppose when you go by now we're gonna have a gazillion lock order reversals and so how do you do the networking so as to not have LORs this is an exercise that's left for the reader because I'm out of time thank you very much