 Thank you. Is the mic working? Yeah? OK, good. So I'm Larry Hastings, and this is Python's infamous gil. I'm going to talk about the gil. I'm going to talk about what it is, what problem it's trying to solve, how it solves it, what the ramifications of those choices are, what its history is like, and what its future is like. And I originally proposed this talk at being a bit of a more of an advanced level. And the way that I wrote it, it's really kind of friendly for beginners too. So don't get scared. So let's run the clock back to 1992. Python's about two years old. It was still young and fresh and evolving rapidly. It was mostly used around CWI, where Guido was working, but it also been released on the internet. There was a new technology out there in operating systems, something called multithreading. This isn't something that people had dealt with very much to date. It was seen on Mock, I think it was available on SunOS. Most operating systems didn't support it. It was in the betas for the new operating system from Microsoft called Windows NT. But this multithreading, what it was, is it was kind of like multiprocess, except that it was all in one process. You had multiple threads of execution happening inside of the same process wall, the same shared state. People wanted to try experimenting with threads in Python, but you couldn't just start adding threads and expect everything to work. There was going to be a problem. The first problem is globals. So a global in C is kind of like a module attribute in Python. It's something, there's only one of them, and everybody knows where it is. If you have multiples of these threads and are all trying to talk to the same global variable, you're probably going to have problems. As an example of one of these globals in Python, there's a pointer to the current executing frame object. And that's, there's one of them, and it's a global variable, and everybody can see it. If multiple people tried to play with the frame object pointer, you'd have trouble. An even worse problem, though, is reference counting. So Python uses a technique called reference counting to manage the lifetimes of objects, to destroy them when nobody's interested in them anymore. And I'm going to do a deep dive right now about what reference counting is, how it works, and how, what the ramifications are of reference counting is with multi-threading. So reference counting. Inside of the Python interpreter at runtime, every single Python object starts with these two fields. The second one is the type of the object, and the first one is what's called the reference count. This is an integer, and this integer tells you how many people are interested in this object right now, how many people are holding references. So if the reference count is three, that means that there are three pointers pointing at this object. And what you do with this reference count is every time you take a new reference to the object, you increment that number. And every time you are no longer interested in the object, you decrement that number. So you ingref and decref. When the object count reference count drops to zero, nobody's interested in this object anymore, and so you should destroy it. You call the Dunderdell if it's got a Dunderdell, and you release the memory, either giving it back to the operating system or recycling and using it for another object. So in C, there are these things called macros. This is kind of like a subroutine call, but it's actually done as more of a textual substitution. So ingref and decref are how you manage the reference count inside of C. These are what they look like today. The ones in 1992 looked much the same. So let's look at what happens if we have two threads that are trying to deal with the reference count at the same time. I'm gonna show it to you in exhaustive detail, and I'm gonna show it to you both when it works and when it fails. Now thread one and thread two, both have references to this object in the center, and they're both gonna drop their references, so they're both calling decref. Let's turn the decref into the underlying C, but even this isn't deep enough. We actually have to go one level deeper and look at the assembly language that this was compiled into by the C compiler. This is actually just pseudo-assembly language, but this is very accurate to what's actually going on. So let's say that the object has a reference count of three. Thread one loads that reference count into something called ax. Ax is a register on the CPU, and you can think of it as a variable. Each thread has its own set of registers, effectively, so they get swapped around. So thread one loads the value of three into ax. It then decrements it, so now it's two, and it stores it back into the object, the reference count, and the object is now two. Thread two comes along, does the same thing. Loads it into ax, decrements it, and stores it, and now the reference count on the object is one. This worked. We have two threads, they both drop their references, the reference count on the object dropped to one. Now let's do it again, but this time we're gonna do those things in a different order. Thread one loads it, and now it shows it three. Thread two does it the same time, almost at exactly the same time as thread one does it, and it also sees three. Thread one decrements, so does thread two. Thread one stores, so does two. But thread two and thread one both wrote the same value. Both threads dropped their references. It should have dropped from three to one, but it only dropped from three to two. This is a bug. This is what's called a race condition. Both of the threads were racing along, and they kind of stepped on each other's feet. Now you have the inverse problem with ink ref. If you have two threads that both ink ref, the object at the same time, then the reference count can be one too small. The problem here, you get different results depending on which it is. If you have the reference count too large, then what happens is, if everybody releases their references to the object, the reference count is still positive, it's one or more, and nobody has references to the object, so nobody's ever gonna clean it up. So now this is leaked memory, and if there's a finalizer, and it has resources, those finalized resources never get finalized, then you're leaking resources too. The inverse problem is much worse. If the reference count is too low, then let's say that it's two, and there are three people who have references to it. The first guy drops, and it drops to one. The second guy drops, and it drops to zero, and it goes away. We now clean up the object. We release the memory either back to the operating system or reuse it, but there's still one guy who has a reference to this memory he thinks it's a valid object. What's he gonna see? If the memory's been given back to the operating system, the program will just crash. If it's been recycled to use for another object, he's gonna see garbage values, he might crash, he might compute some inaccurate result. Who knows? But this is a really nasty bug, and it's really hard to debug. So getting your reference count numbers wrong is really painful, so we wanna prevent this from happening if we're going to make Python multi-threaded. So the first approach that one might think of would be to add locks around everything. A lock is something used in multi-threaded programming to prevent multiple people from using something at the same time. You can think of it as kind of like being a bouncer. There's a bouncer standing at the bar, and he only lets one person inside at a time. So the first person comes along, he says, go on it. The second person comes along and they have to wait until the first person comes back out, and then the bouncer says, okay, now you can go. So we could throw a lock around every single global thing. We could say, in order to use the frame pointer, you have to be holding this lock. In order to be hold, or maybe we could do it for groups of things together. We could group together all the stuff that's for the runtime currently executing stack frame. And we also would need at least one, maybe more than one lock, for reference counts. The problem with using multiple locks like this is you have a special kind of race condition called the deadlock. So let's say that we have two threads here. This is an example of a deadlock. A and B are locks, and thread one and thread two are gonna try and use those locks. The problem is that they grab them in the opposite order. So thread one grabs A, then B. Thread two grabs B, then A. And the problem is if you execute them in this order, lock, thread one tries to get lock A, and that works fine. Thread two tries to grab lock B, that works fine. Now thread keeps going. Thread one keeps going and tries to grab lock B. Nope, has to wait. Thread two tries to grab lock A. Nope, has to wait. These two threads are now deadlocked. They will never make progress. And if any other thread comes along and says, I want thread A, or lock A, or lock B, it's gonna get stuck too. This is a deadlock. This is another painful thing in multi-threaded programming. You don't want to have this problem. So Guido, very carefully, in 1992, very cleverly, created something called the GIL. That's it. That is the interpreter lock. Since it's global to the whole Python running process, we call it the global interpreter lock, or the GIL for short. That's what it looked like in 1992. Almost two years to the date, by the way, since the Python repository was created. If you want to see what it looks like today in Python 2.7, that's it. It really hasn't changed in whatever this is, 23 years. In Python 2.7, anyway, we'll get into that in a minute. Now, in order to interact with the Python interpreter to do anything, you have to be holding the interpreter lock. In order to interact with an object, in order to run Python bytecodes through the interpreter, you have to hold the GIL, which is both good and bad. So let's talk about the good. This is simple. Simple is easy to get right. Because it's part of the Python CAPI, extension modules have to hold the GIL in order to interact with the Python interpreter. This made it a very simple rule, and it was very easy for extension authors to get right. This led to people writing lots of extension modules for Python. The story of the Python GIL is really the story of the success of Python. People who hate the GIL don't understand Python's history. Without the GIL, Python would not have been as successful as it is today. So also because you only had what GIL, you can't have a deadlock because there aren't two locks that you have to deal with. Now, if you have IO bound code, if you have multiple threads and they're all blocking on IO, then this also just works magically. Your program runs as fast as it can. If your program is CPU bound, then your program is effectively single threaded because again, only one thread can hold the GIL at a time and therefore only one thread can be interacting with the Python interpreter. So if you have three threads that all wanna be running code, only one of them can do it at a time. Therefore, you're effectively single threaded. We're gonna talk about that. That's why people don't like the GIL. Now about this, the reason that it works magically for IO bound code, there's another one of these macros. It's two of them actually. Pi begin allow threads and pi end allow threads. So we don't realize, if you're gonna go off and do some long computation and see, if you're gonna be calculating some long Fibonacci number or you're gonna be blocking on IO or something, you're not gonna be interacting with the Python interpreter. You might as well drop the GIL. Let somebody else use it for a while and then you can go and grab it again when you need it. So that's what these macros do. And these are again, very easy to use. You can use an extension modules, they're used all over in the Python source code. Anytime you're gonna do something that might take a while, drop the GIL, do your thing, grab it again. So what's the examples of these things? If you're gonna go to sleep, obviously you don't need the GIL. If you're going to read or write from a file handle, you can drop the GIL. If you're gonna read or write from a socket, you can definitely drop the GIL because that can take who knows how long. But sharing the GIL in CPU bound code, that doesn't work. Now, the GIL didn't change, like I said, for a very long time. In 2009, David Beasley gave a presentation at a Chicago area Python user group. David Beasley's a computer researcher, he loves Python and he's very interested in the GIL, so he's played with it a lot. And he discovered some things that were kind of awful. So first of all, when you have multiple threads that are trying to use the GIL, there's an internal thing in Python. I should have put it on the screen, it's not there. It's called the sys.setcheckinterval and sys.getcheckinterval. This is a number that's set to 100. And what that says is, every 100 bytecodes, the currently running thread is supposed to say, okay, let somebody else have a turn. That doesn't work. Here is a diagram showing two threads running at the same time. They're both CPU bound. They both want to use the GIL all the time they can. And you can see that little arrow there, that's 66,000 ticks. This is 667 times longer than it was ever supposed to be. This checkerboard was supposed to be like tiny little slices and they're running for a very long time. So people aren't releasing the GIL when they're supposed to. But it gets worse if you have two threads and one of them is IO bound and the other one is CPU bound. What you want is, the IO bound thread doesn't run very often and it doesn't run for very long. It does a little work and then it waits on a socket. And then it does a little more work and it waits on a socket. You want to let that thing run whenever it's ready to run. So if you have a CPU bound thread and an IO bound thread in the same process, you want to prefer the IO bound thread. Here the CPU bound thread is getting all the time and the IO bound thread never gets to run. This is what's called priority inversion in the world of operating system schedulers. And Python had it in a really bad way. The problem is this code that's supposed to be releasing the GIL and letting other people have a chance to run. So this is it. What it does is this PyThreadStateSwap thing, that's this macro that we saw before, this PyAllowBeginThread, or whatever it was called, and thread and begin thread. So this is releasing the GIL here. Then we say, other threads can run now. And then the immediate next thing we do is acquire the lock. So what was really happening is if we have two threads and this one has the GIL right now and this one wants it, he's saying, you can have the GIL, oh, take it back. You can have the GIL, oh, take it back. This guy never gets a chance to get the GIL. And it's worse, if you CPU bound, he never has a chance. And if he's IO bound thread, he's never got a chance. So this, by the way, is still what Python 2.7 is like. This is Python 2.7. It was fixed in Python 3 in 2009 by Antoine Petroupe. He wrote what's called the new GIL, which is only a minor change to the GIL. It added a new variable called GIL drop request, which is something called the condition variable. In effect, this is a flag. If the second guy is trying to grab the GIL and he's not getting it, he sets this flag and says, look, there's somebody waiting, you really have to give it up. And then the first guy who releases it, he releases it and he looks and he sees this flag and says, oh, I really have to back off and let other people have a chance. So with this change in place, the really bad behavior that David Beasley was saying is fixed. There are still funny conditions that you can arrive at where you can still starve threads for CPU time and never getting the GIL, but it's a lot less bad now. And to fix those really funny conditions would make, would require making the GIL a lot more complicated. And what we're really talking about at this point is writing a scheduler. And I only know a little bit about operating system research, but I can tell you that writing a scheduler is very complicated. And the more fair you wanna make it, the harder problem it becomes. And this is really kind of a raffle that we don't wanna go down in the Python source tree. So the plan was, we're gonna try a GIL this way and see if it worked okay. And basically it's been fine and we just wanna leave it alone. So this is the GIL that's in three, two and above, I think. Now meanwhile, while we haven't been paying attention to all this, the world has kind of changed around it. So I remind you, Python itself was written in 1990, originally created in 1990. And multi-threading was added in 92, I think. Now you didn't see multiple cores. You didn't see multiple CPUs in a computer very often back then. You'd have to actually go and get a second CPU and have both of them on the motherboard. I had a computer like that in 1995, 96. I had two Pentium 133s in one single computer, but I was kind of a weirdo. You didn't really see it prevalent until you had multiple cores on a single CPU. And you didn't see that until about 2005. That's when server CPUs and desktop CPUs and actually the CPUs used in game consoles for the home, those all went multi-core in 2005. And that's when we started really seeing a lot of uptick in multi-core use and programming. Laptops themselves went multi-core in 2007. Tablets and phones and systems on a chip that are embedded in very small projects. Those all went multi-core in about 2011. Eyeglasses went multi-core in 2013. Watches went multi-core in 2014. Ladies and gentlemen, eyeglasses and wristwatches are now multi-core. We live in a very multi-core world. And we still don't have a solution for running Python on multiple cores simultaneously. Now, Guido has talked about this. He wrote a blog post back in 2007. He said, and he said, by the way, Python 3K, that's the old name for Python 3. He said, I'd welcome a set of patches in the Python 3K only if the performance for a single-threaded program and for a multi-threaded but IO bound program does not decrease. This is a tall order. It's been almost eight years. Nobody has done it. We're not sure how to do it. People have tried in the past. There was a famous patch back in the Python 1.4 days called the free-threading patch. What it did, it didn't require changing the CAPI, which is good. It took all of those global variables like the pointer to the current frame. It sucked them all into one single struct, excuse me, a C struct, which is roughly equivalent to a Python class. So now you could have more than one of them. So he created one per thread and now they didn't stomp on each other. And then he added a single lock, a mutual-exclusion lock, around in-craft and de-craft. So you had to be holding this lock in order to do an in-craft or a de-craft. And so you would grab it, do your plus or minus and release it. This added so much overhead that performance is terrible. It was about, it was between four and seven times slower. And this is results based on David Beasley reviving this patch like 10 years later. A couple of years ago, a guy named Antoine Petru, who's around the conference today, he tried something using atomic tests instead. So modern computers actually have machine language instruction that is a single instruction that says, subtract one from this value over here. So you remember those three instructions, load, decrement, and store. It combines with a single instruction, which it guarantees will happen atomically and you'll never have any problems of races. So that's good. It doesn't require any API changes, which is also good, but it makes it 30% slower. So again, it doesn't meet the bar that Guido was talking about. Now here's something to consider. There are four big Python interpreters in the world. There's Cpython and then there's everybody else. Cpython has a gil, everybody else doesn't. Cpython uses reference counting, everybody else doesn't. What do the other people use? They use something called garbage collection, mark and sweep garbage collection. You can, if you have a garbage collection, you don't need the gil nearly as much. It's very easy to say, okay, we can get around it. PyPy actually has a little bit of a gil that they just use during garbage collection cycles. They say, it wouldn't be that, it'd be kind of a pain to get rid of it, but it's not impossible. It's impossible to get rid of the gil as long as we're using reference counting. So people say, we should go to garbage collection with Python. The problem with that is that changes the API. We change, so in order to talk to the Python interpreter, you call the C API. In order to deal with objects in the C API, you have to understand Python's memory management approach. So C, extra programmers have to know intref and decref. If we got rid of those and changed to garbage collection, that changes the API. That breaks every extension out there. There are a lot of extensions for Python. We don't wanna break them. So I'm not sure that we can do this. Would it be any slower? Would it be faster? Conventional wisdom, or at least told to me by Michael Ford, is that garbage collection and reference counting are roughly the same. But we wouldn't actually know until we did it. And it would be a lot of work and nobody has been gutsy enough to try. Here's something else to consider, though. So all of the major languages that have come out since Python was invented, like the JVM for Java and the CLR for C sharp, Rust and Go, all these things are using garbage collection. Also, all of them don't have C APIs. Python doesn't have a C API either. Now why is that? Well, the obvious reason is that these languages are generally considered about as fast as C. So the reason for writing a C extension to speed things up, you don't need it. They all have the ability to call out into a C library. And so maybe they think that, well, you just don't need it. You can call out to a C library and you don't need the optimization of an extension. So we don't need the C API. But another thing to consider is that, like I've just said, having a C API means exposing your memory management model to the world, which means that now your hands are tied. If you change that memory management model, you break everybody's C extensions. If they don't expose the memory management model for the world, they are now free to change it internally. PyPy started out with a simple mark-and-sweep, I think, he can correct me. It started out with simple mark-and-sweep and over time it became generational and it became an incremental collector. And Armin told me that the change to incremental wouldn't have broken anybody's code. The change to generational probably would have. If they had exposed a C API, they wouldn't have been free to make those changes and PyPy wouldn't have the better memory manager that it does. So if C Python switched to garbage collection, again, this would break every C API out there, and then if we made improvements to it later, we might break everybody again. So in my opinion, I think that if we went to the route of trying to change to a garbage collection, we should probably give up on having a C API. But again, we break everybody's C extensions and we don't fix them again. Instead, we say rewrite your extensions in Python and use CFFI, which by the way is wonderful because then that lets you run on PyPy and also Jython and IronPython are talking about CFFI support. So this would mean that your extensions were now portable across Python implementations. I think for political reasons, the only thing that we could really talk about is atomic test and set. And even with a 30% speed loss, I think that the pressure for supporting multi-core is going to get big enough that we're really gonna have to look at doing this. That's it. So the coffee break has started. So if you're jonesing for a cup of coffee, you can take off. If you have questions, I'm happy to answer questions for a couple of minutes until they kick me out. Somebody is holding up their hand on the aisle in the center. Yeah? First of all, thank you very much for this awesome talk. So you say one of the main problem with a garbage collector is C extensions. But what if somehow mark the code as pure Python code that doesn't interact with the C extensions? And there you can, with those objects, you can do garbage collection. And somehow mark that the code that interacts with the C extensions, and there use Gil or atomic increments, decrements. Because like there is a lot of pure Python code out there. Okay. Why not use garbage collection there? Python code that runs in Python doesn't care about whether it has a garbage collector or whether it's reference counted. So I kind of didn't understand your question. Maybe you're gonna have to clarify it. But I will point out to you that C Python has reference counting and all the other Python implementations use garbage collection and your code still runs either way. So it doesn't matter. So I'm not sure I'm answering your question. I mean, imagine you have a pure Python program that doesn't interact with any C extensions. What prevents having a garbage collector there? Nothing. You can run it today because you can run it under Jython or IronPython or PyPy and those have pure garbage collection. They don't have... No, I mean, why not implement it within C Python? I'm sorry? Why not implement it within C Python? Why not implement in C Python? Because that breaks all of the C extensions and people use the extensions all the time. Yeah. And to provide a data point for this, can I interrupt Larry? Sure. So ResolverSystems at one point implemented something called IonClad which was an implementation of the Python C API for IonPython. And IonPython doesn't have a gil and it doesn't use reference counting. But all of the... We maintain binary compatibility with existing binaries because this was for Windows and people basically can't compile their own extensions there. And so what you have with those existing binaries is you have C objects, effectively, Python objects implemented in C that expects to use reference counting. And so we had a hybrid system. We uttered objects that were returned from the C extension. We artificially inflated their reference counting by one so that the macros compiled into the binaries would never be triggered by getting down to zero. And then we had a proxy object that if garbage collection was entered for this object then we would trigger reference counting for the... We would decrement the reference count to zero. Technically we had a leak though because if you pass in references to Python objects to the C extension, the C extension could keep references to those alive. And essentially what Larry is saying is that if you move to something like Mark and Sweep for the main Python C interpreter those C objects are opaque to the garbage collector. It can't know what internal references you have. That's kind of what you're saying, isn't it Larry? Unless you change the API. Right, so Python actually supports garbage collection. C Python does support garbage collection but it's only used in a very small way. It's used to break reference count cycles and most objects don't bother to implement it. So if you had an object you created in the C extension it might have references to other objects and you have no visibility inside of it. The world that he lived in with Iron Python was a little different where they were running inside of the CLR and the CLR knows everything. In the C runtime environment we really know nothing about the memory that's been allocated. I mean, in practice we had no memory leaks that we could find from objects, from C extensions keeping other objects alive or those objects dying when the C extension. But that was only in practice it seemed like it worked. So you kind of want your system to be theoretically tight, not just working by accident most of the time. Yes. So I guess the answer to my question is no but is it possible that two threads think that the gill is free? They claim it and then something strange happens? No, the lock objects are provided by the operating system and they're very good. So one thread grabs the lock and the operating system does a very good job of only allowing one person to have it at a time. Yeah, it's not fast. It depends on the operating system actually. Like Linux has what they call the few-text and I think Python may use a few-texts on Linux. So that is very fast. It depends on the operating system. Some locks are actually pretty slow. But we live in 2015, I think all of them are fast these days. Hi. What I think is the feasibility of having CPython clone interpreter that breaks the CAPI and you use it when the extensions have caught up. So you're talking about forking the Python interpreter and going to garbage collection for that? Yes, certainly you wouldn't, that's possible. I'll point out to you that we're in 2015 and CPython 3.0 came out in what, 2008? And there are a lot of C extensions that still haven't caught up here. So people are very slow to adjust to changes to the CAPI. And I worry that the change to garbage collection would be even worse than the change from Python 2.0 to Python 3.0. So I'm not optimistic about making a fork of Python that has pure garbage collection and everyone's saying, oh, I want to switch over to that. And it's really taking. I mean, at some point you make so many changes that it isn't recognizable Python anymore and you might as well switch to PyPy because those guys have been doing it for a long time and it kind of works mostly, okay? I think everybody wants to go to coffee. Thank you. Thank you.