 Okay, let's welcome Larry Hastings, a.k.a. DS Dad, for the removal of the GIL. Yes, this is a talk about the galectomy, which is a project to remove the GIL from C-Python. Let me preface my comments by saying this talk is going to be exceedingly technical. I'm just going to go right into the heart of the matter, and so it's kind of designed for people who are already core developers who are familiar with the internal C-Python. I'm hoping that you'll understand multi-threading pretty well. I'm hoping that you'll know at least a vague understanding of how C-Python works internally, the concept of objects and the reference counts on objects. Also, if you don't understand this stuff, a good thing to do would be to watch my talk from last year. I didn't give it here, but I gave it at other... No, I did give it here, actually. It's called Python Sim from the skill. It's on YouTube, and it'd be really good if you could go back in time and watch it before you came in the door. Anyway, so let's talk about the GIL. The GIL was added in 1992, and barring the addition of a condition variable to enforce fairness, it has remained essentially unchanged in the 24 years since then. I want to make it clear the GIL is a wonderful design. It really solved the problem. It was a fabulous design for 1992. It really still holds up today in a lot of ways, but there are some ramifications from this design. So, first of all, the GIL is very simple, so it's really easy to get right. C extension developers don't have any trouble understanding how to use the GIL. Internally, we would never have any problem about owning the GIL or not owning the GIL when we're supposed to or not supposed to. Since there's only one GIL, we can't have deadlocks. There's only one lock. You can't have a deadlock between the more than one locks. And since you only ever run in with a single thread, there is almost no overhead of switching between the GIL. We do it only when we switch threads. So, your code goes real fast. The GIL adds very little overhead to your code. Now, if you're single-threaded, your code is going to run really fast. This is a really great design for single-threaded code. If you're IO-bound and multi-threaded, this works great. And this was actually the original design for what threading was for, back when all computers were a single processor anywhere, almost all of them. The problem is when you want to have a CPU-bound program and you want to run multiple cores simultaneously because you just can't. And that is the pain point of the GIL. So, again, in 1992, all the computers around us were all single-core, even the big servers, but the world has changed since 1992. These days, we have these wonderful laptops, which are multi-core. And even our phones are multi-core. And our wristwatches and our eyeglasses have all gone multi-core. I have a workstation at home that has 28 cores in it. If you count hyperthreading, it has 56 cores. We live in a deeply multi-core world, and Python is kind of ill appraised to take advantage of that. I want to point out this comment is still in the CPython source code today. CPython has only rudimentary thread support. I suggest that maybe it's time to consider adding more sophisticated threading support to CPython. After all, the goal of a programming language really should be to expose all of the various things that your computer can do to you, take advantage of all of the different resources your computer offers. And Python can use all of them, except for the multiple cores that you have. So it's kind of a sore point. Now, there was an attempt back in the 90s, this thing called the free threading patch. This was an attempt in Python 1.4 done in 1999 to get rid of the GIL. It didn't require changing the API, so it didn't break C extensions, which is a good design. What it did is it moved most of the global variables inside the interpreter into a single structure and added a single mutex lock around Inker and Decker. I believe there was a Windows variant of it at the time that used interlocked increment and interlocked decrement, which is a Win32 API that's equivalent to atomic Inker and Decker. But the single mutex lock was a little on the slow side. Your program would run between four and seven times slower, which, let's be clear, what everyone wants, they want to get rid of the GIL because they want to use multiple cores because they want their program to go faster. So when I say, oh, we removed the GIL and it goes slower, nobody's excited. So this was not a very exciting patch at the time. If you want to read more about it, there was a lovely blog post by David Beasley a couple of years ago and got it to run on modern hardware. This is called an inside look at the GIL removal patch of lore. And I looked at that, too. But let's talk about what I'm doing now. So the galectomy, I have a plan to remove the GIL and actually what I should say is that I have removed the GIL. I removed the GIL back in April. The problem is that it's terribly slow. But in order to remove the GIL, you kind of need to have a plan in place. There are a bunch of considerations you must account for in order to remove the GIL and how the project can be successful and maybe get merged or used by people someday. So I say that there are four technical considerations you must address when you are going to remove the GIL. Those are reference counting. Again, every object in the CPython runtime has a reference count that tracks the object lifetime. And this is traditionally kind of unfriendly to multi-fattery approaches. There are global and static variables. There aren't nearly as many as I thought there were, but there are a couple. There's some per-thread information, which I think all lives in one place now. There's a bunch of shared singledown objects, like all the small integers, like negative one through 16 or something like that. None, true, false, empty, tuple. These are all... Python creates one of them and every time you use an empty tuple it uses the same empty tuple everywhere because it's immutable. You need to address extensions. C-extensions currently run in this wonderful world where they don't have to worry about locking because the GIL protects them. They've never run in a world where they can be called for multiple threads at the same time. They've certainly never run in a world where multiple threads could run in the same function at the same time. And so there's a lot of code that depends on only a single thread running in the function, like if static thing is equal to null, then initialize static thing. All that sort of code is just going to break when we go multi-core. And finally, we need to worry about the atomicity of operations in Python. So the developers of the other Python implementations, PyPy and more strongly, IronPython and Jython, they discover that a lot of C-Python code implicitly expects a lot of operations to be atomic in C-Python. If you append to a list, or if you set a value in a dict, another thread could be examining that object, and it must not see that dict or that list in an incomplete state. It needs to either see it before the append has happened or after the append has happened. So we need to guarantee that atomicity of operation that you can never see an object in an incomplete state from another thread. But in addition to these four technical considerations, I say that there are three political considerations that we must address because it's not simply a technical solution. We need to also make sure that the C-Python, like there's a whole world of people using C-Python out there, and there are demands that are going to be made on moving the gil that are not strictly technical demands. I say that these are. We need to not hurt single-threaded performance. This was actually something that Guido established in a blog post, which I'll talk about in a minute. We need to not make single-threaded code slower. We not make multi-threaded IO bound code slower. That's a very high bar to meet. We need to not break C extensions. This is sort of my statement. C-Python 3.0 broke every C extension out there, and it's been however many years, five years, six years since C-Python 3.0 came out, and there are still plenty of extensions out there that haven't upgraded to the new extension API. We need to try and avoid breaking C extensions as much as possible. And finally, don't make it too complicated. Of course, this is a judgment call, but C-Python, one of the things that's really lovely about C-Python is that it's pretty easy to work on. Internally, it's not all that complicated. It's conceptually very simple. The code is very clean, and it would be a shame if we broke that feature of the C-Python source code in order to get rid of the gil. So let's try and preserve that. Now, there are a couple of approaches that people have talked about, ways to get rid of the gil that I don't think will work, and so just to sort of set the stage, I want to talk about those for a minute or two. There's what I call tracing garbage collection. This is also mark-and-sweep garbage collection. This would let us get rid of reference counting, and again, reference counting is traditionally very difficult to do in a multi-threaded environment, so this would be very favorable to multi-threading. Tracing garbage collection, it's not clear whether it would be faster or slower than reference counting would be traditional wisdom about this. Conventional wisdom says that garbage collection and good reference counting implementations are about the same speed, and then people like to argue, but that's the internet for you. Where this falls down is that this is going to break every C extension out there. It's a very different world going to pure garbage collection as opposed to reference counting, and so C extensions are just not going to work anymore. So that breaks every C extension. You can kind of afford to do it politically, and it also would be very complicated. It's a much more complicated API than reference counting. Reference counting is a relatively simple API, and still, people mess it up. It can be a little obscure at times. It can be a little hard to figure out exactly what the right thing is to do with reference counting. Garbage collection, I think, is going to be that much worse. Even more so than tracing garbage collections, what's called software transactional memory. Armand Rigo, who just showed up today, he's been working on software transactional memory as a research project with the PiPi interpreter for a couple of years now, and it sounds like a fantastic technology. Is it going to be fast enough? Yes, absolutely. If software transactional memory works, it's going to be really fast. It's going to be really great. It's going to take wonderful advantage of multiple cores, and you're going to have very little locking involved, but it really falls down on the other two. It's going to break every C extension out there horribly. It's going to be incredibly complicated internally. There's a lot of research quality stuff right now, and it's not clear to anybody when it's going to be ready for production, and I don't think that C Python is able to wait. So let's move on. Let's talk about my proposal and the specifics of my proposal. So again, I said there were those four technical considerations. The first is reference counting. What I say is we keep reference counting. That way we don't break C extensions, so it's going to be the same API that we have now, Pi Anchor and Pi Decker. The important thing is that the pile time C API does not change. Now, like I said, I got rid of the GIL in April, and what I did is I switched to what's called atomic anchoring Decker. This is where the CPU itself provides you with an instruction that says I can add one or subtract one to this memory address and do it in such a way that it's not possible to have a race condition with another core. Works great. Costs us 30% of speed right off the top. So this is working great, and this means that our programs are correct, but it's awfully slow, and we're going to look for alternate approaches here. Global and static variables kind of handle that on a case-by-case basis. Again, for all the per thread stuff, it's already been moved into piped thread state for me. I guess I was done a couple of years ago. I hadn't noticed, so that's ready to go. Shared singletons, they just remain shared. All those shared objects, like the small integers and none, excuse me, and true and false, those just get shared between threads, and the whole point of getting rid of the gill and running multiple threads is that C Python programs don't change. C extension parallelism and re-entercity, there's just nothing for it. They're going to be running in a multi-core world, and they're going to be called multiple times for multiple threads simultaneously, and they just need to get rid of, get with the program, so it's going to break C extensions all over the place. Atomicity of operations. We're just going to add a whole bunch of locks. Every object in C Python that is mutable will have a lock on it, and it will have to be locked while you're performing the mutable operation. So this is going to add a new locking API to C Python. They're going to macros, pylock, and pionlock. This is going to call a turn into the pytype object, which is going to sprout in two new members, oblock and obunlock, which I'm guessing will be exposed to Python programs as dunderlock and dunderunlock. All these functions, they only take one parameter, which is the object to lock or unlock, and they return void because they always work. For objects that are immutable, my claim is that this oblock and obunlock, those can be null. So you just either you support locking or you don't, and if you don't support locking, then you don't even need the functions, and you can just skip them. So what objects need locking? It's all mutable objects, and when I say all mutable objects, I mean C mutable, not just Python mutable. For example, consider the stir object, and from the Python perspective, stirs are immutable, right? But internally, they have a couple of lazily computed fields, like the hash. The hash is initially initialized to negative one, which, by the way, if you ever looked at the hash function, it says it will never return negative one. Negative one means uninitialized internally, so that's why a negative one value is in the legal hash value in Python. So it's initialized to negative one, and then the first time somebody says, give me the hash of this stir object, it goes and computes it, stores it, and returns that. So that's mutable state. Now, in the case of the hash, that's harmless. If we had two threads, they both saw the negative one, they both compute the hash, they both override it, well, they're going to be overriding with the same number, so that's harmless. But there are two more fields, UTF-8 and WSTUR, both of these are also lazily computed. These are conversions to UTF-8 or UTF-16, respectively, and those allocate memory. And if there was a race where they saw null, and they both all go off and allocate memory, and they both override, you're going to leak memory at that point. So we're going to have to put a lock around those. So the stir object is currently not safe, and I haven't dealt with it yet, so right now we can leak memory inside of C-Python. That's terrible. So every object is going to be locked inside of the galactomy, which means that we have to have as light a lock as possible. I would call this a user space lock. Under Linux, we have this wonderful thing called the Futex, which is literally a lock you can declare any four byte aligned memory address is a lock and you can wait on it. It's really more of like a building block for writing your own mutexes and other synchronization objects. It's really great. Windows for 20 years has had what they call the critical section, which is user space only until there is contention. And OS 10 has what they call pthreadmutex. A couple of people now have told me that pthreadmutex is guaranteed to be user space only until there is contention. So we have the user space locks that we need for all the major platforms. I don't know about the other platforms, Solaris and FreeBSD and all those sorts of things. Somebody else is going to do that work. But all the major platforms that Python runs on, we're going to have the support for user space locks that we need. Or maybe they don't get a no-go Python. We'll see. Now as for the political considerations, for my approach with the Galactomy, I would say that, yes, it's not going to be any slower. And yes, it's not going to break C extensions. Now this may be crazy because I'm declaring, I just told you a couple of minutes ago, I was going to break every C extension out there because of atomicity of the operations. And I'm making it 30% slower by adding atomic anchor and decker for reference counting. So how can those two statements be true at the same time? The answer is that I'd say that we have to just have two builds. So we would have Python built with the Gil and without the Gil. You build it with the Gil and everything is the same as it is today. And that way all the C extensions continue to work. That would be the default build for everybody on every platform. And then if you're some sort of futuristic person who wants to live in the multi-core world, you can build Python in the special no-Gil version at which point the PyLock and PyUnlock start to work. So these macros, these PyBegin, AllowThread and End AllowThread, these PyLock and PyUnlock, those would either be no-Ops or active depending on which build you were in. If you had a Gil, then begin AllowThread and End AllowThread did something, and PyLock and PyUnlock would be no-Ops. If you don't have a Gil, then lock and unlock are going to do something and begin a little End AllowThread are probably no-Ops, although I may hide some work in there. This also means that a C extension will never accidentally run with the wrong build because we can have different entry points for each one. If you have a Gil, if you have a module just called module, then you have an entry point called initModule. We could say, OK, if you run in the withoutGil build version, then we're going to have a different entry point with a noGil in front or something just to make them two different entry points. That way, no will ever run in a noGil build accidentally, and it's strictly opt-in. No C extension will run in noGil build until they're ready, until they declare themselves that they have a noGil entry point. You might actually be able to build a single extension that worked in both, by the way. We could add... Other things are macros, we could add actual C functions for them, and if you were very careful, you might be able to write a single .so that supported both modes. I don't know if that's interesting or not. It's just something that I'm mentioning. As long as we're effectively declaring a new CAPI, because this is really what this is at this point, it's kind of a new CAPI. It looks very similar to the existing CAPI, but it has this reference counting that works a little differently, and the atomicity of operations means you have to have locking all over the place, and you can't guarantee that you're going to only run on a single thread at a time. This might be a good time to start inflicting some best practices on people that currently are optional. It's actually true in CPython that you can declare your own type statically and you can create an object with it, and you can pass it into the CPython runtime, and the CPython has never seen this object or this type before, and it has to work. We could stop allowing that. There's a function you're supposed to call called PyTypeReady that's optional, and we could say, okay, now it's required. But the same token, there's a new PEP called PEP489. This thing called multi-phase C extension initialization. I don't really understand it, and I was like, well, that's very relevant. So this might be a good time to say all these things that used to be optional now, they're required. If you're going to run a no-gill build, you have to call PyTypeReady. You have to call PEP489. You probably have to use the limited CAPI, all those sorts of things. Now where this gulectomy idea falls down is the don't make it too complicated idea. It is getting a little complicated because we're effectively talking about two different builds running at the same time from the same source base. So a CPython core developer would have to read the code and say, oh, PyLock, that's only active in the no-gill build. Oh, Py, begin allow threading. That's only active in the with-gill build. And so they're going to have to sort of read every bit of code twice to see how it's going to react in with-gill and without-gill. You're also going to have to be very careful about where you lock, but ultimately, this is the price we're going to have to pay in order to get rid of the gull. I don't see any simpler way of doing it. This is something I've been working on for a couple of months. I think I started in February. Initially, I was calling this whole thing ConfusaCat just to pick something from Monty Python. But then the name gulectomy came up and like, well, that was done. That was the name. Now, as I mentioned, back in 2007, Guido wrote a blog post called It Isn't Easy to Remove the Gull, where he talks about what would have to happen in order to remove the gull. And I agree with everything in this paper that he wrote. Really insightful. Except for the title. Turns out, if you know where to start, you can remove the gull in about a week. Here's how. Step one, or step zero, really. Atomic Inker and Decker. You switch Pi, Inker, and Pi Decker to use Atomic Inker and Decker. I only support 64-bit Linux right now, so I just went to GCC and used the IntranSix. So I only support GCC right now. Number one, you have to pick what kind of lock you're going to use. Again, on Linux, I'm using Futex-based locks. There's a paper from Ulrich Drepper called Futexes are Tricky, where he walks you through how to write a Mutex based on Futexes, and I'm basically using his design. Step two, you need to write, throw locks around the entire dict object. You cannot run a C-Python interpreter without having a working dict, so the dict object needs to be safe. You just need to go through every external entry point, and if someone is calling into the dict object from outside, you need to make sure that it's locked properly and unlocked properly. Step three, the same thing with the list. Again, C-Python uses dicts and lists internally for a lot of operations, and you just can't have a working C interpreter unless you've got both of those working. Step four, there are about 10 free lists inside of C-Python, where when you allocate an object, it looks to see if there's a free one waiting, and if there is, it just uses that, and if there isn't, it has to go off to the allocator. The list makes things go a little bit faster, but obviously they're not thread safe yet, so you need to add a lock around them. You need to do that about 10 times. Step five, you need to disable the garbage collector and GC track and untrack. The garbage collector is just completely broken in the Gillectomy right now. It's going to be quite a while before we get that working again. Which actually, by the way, makes my numbers look a little better than they really should, because there should be some garbage collection overhead that I don't have. The garbage collector is just completely totally broken in the Gillectomy, and it's just completely shut off right now. Step six, you need to actually murder the Gill. This was a pleasure when I got to do that part. There's just a structure. You just don't allocate that variable anymore. Take all of the things that are switching the Gill and just stub them out or comment them out or whatever, and they all go away. Step seven, when you switch threads with the CPython internally, there is a thread state that's stored in the global variable, and everyone just refers to that. So whatever thread you're on, they just look in the same spot, and that's always the information about the current thread. And obviously, you can't do that anymore if you're running multiple threads simultaneously. So instead, every time that people refer to that, they're actually going through a macro. You just need to change the macro so that it pulls that thread state variable out of thread local storage. That was actually pretty easy to get working with. And finally, you need to fix some tests. Specifically, there were only a couple of tests that really broke when I did this. Mainly, they were sensitive to testing exactly how big the dict object and the list object were, and now that I'd added this lock to it, they had gotten a little bit bigger, and I just needed to fix those. And actually, the entire Python regression test suite started to work apart from the stuff that was actually using threads, and there were a couple of those. In fact, at the language summit, I announced that it was about 3.5 times slower by wall time and about 25 times slower at CPU time. What I mean by that is I was running a test. I run the test the same way every time. I run seven threads, all running the same program, and I time it. And I did it with normal CPython, and I did it with removing the gil, Galectomy CPython. And when I did that with seven cores, it was 3.5 times slower if you would just watch the clock on the wall. But if you count up how much CPU time was used while I was using seven cores as opposed to normal CPython just using one core, and so you multiply that number by seven and it's about 25. 25 times slower to do the same amount of work, which is kind of depressing. This is the official benchmark of the Galectomy. This is what everyone has been running. It's a really bad Fibonacci generator. I'm showing you this just to impress on you how horrible the benchmarking is so far, how much, how little code I can run through a multi-core CPython right now. But this does work, and I can run it on multiple cores simultaneously. It's not exercising very much code inside of CPython. It's looking up the Fib function over and over and over in the module dict. So there's a single dict that's just getting slammed with lookup requests, and since it's locked, that means there's just some contention around that lock. We're performing a function call, which has always got some, it's a pretty heavyweight operation in CPython. We're running a little bit of bytecode. We're talking to the small integers, like 2 and 1 and 0, and actually all the small integers, because of the way Fib works, like you use all those small integers a whole lot. And again, the small integers are shared between threads, and they all have reference counts, which means that we're changing those reference counts constantly from multiple threads, which is costing us a lot of performance, it turns out. And we're doing a little bit of math, and the math really isn't hurting us at all. So this is what it looks like. I got some flak for not labeling my axes, so there I've labeled my axes. The vertical is time in seconds, horizontal is number of cores that are being used, and this is Gil versus Golectomy. So having the Gil is the blue line. It's way faster to have the Gil right now. And with the Golectomy, this shows you that it's taking, it seems to be curving off, so at some point it might actually go be, it might not be making it that much slower to add a core to it, but that's going to be a way, way out. There's also this dip around four. I don't know why it's there. I think it's just the way that the tests interleaved. I would say ignore it, assume it's not there, but I had to show it because that's what my data actually showed. But more interesting, again, this was really wall time. I think CPU time is more interesting. So the amount of time that it took to compute these seven Fibonacci numbers, it was Fib of 30, I think. And CPython is next to nothing. You compare that with running it with the Golectomy, and you just goes crazy up. So obviously it's incredibly slower. How much slower? This is a graph of how many times slower it is per core comparing normal CPython to the Golectomy version. And again, there's this dip around four. I would say ignore it. But what this is telling us is that it's about twice as slow with one core, and then it shoots up to about 10 times slower with two cores, and then it just keeps going up and up and up. I think seven cores is about 19 times slower here. So why is it so slow? After all, the Golectomy isn't changing that much code, or at least not yet. So the first thing I would say is that I don't know for certain. It's kind of hard to measure at the sprints at Pycon a couple months, I guess early June at that point. There were some Intel guys who hung out with me, and they ran it under VTune, and they kind of confirmed some suspicions here. The second thing is actually lock contention, and that's what everyone was probably assuming was number one, but it's actually number two. Number one is synchronization in cache misses. This is what's really slamming the Golectomy. Something to consider is that nothing inside of CPython is private. So like a normal multi-core program you might write, you might design it around being multi-core, and you'd have, okay, here's this thing that's thread local to this one and thread local to that one. There's almost nothing in CPython that's thread local. Everything is shared across all cores all the time, and all the cores want to talk to them simultaneously, and that's kind of the fundamental thing that's killing performance, is that we really don't have any thread-specific stuff. So let's talk for a minute about why things are slow and fast. So, oh, that disappeared. Okay, so this is cache. Your computers at this point have three levels of cache between them and the RAM that they're talking to. And if it's 1x to talk to level 1 cache, then level 2 cache is about 2 times as slow, and level 3 cache is 10 times as slow, and talking to RAM itself is about 15 times slower. So you want to be talking to cache. Computer CPUs are so fast that normal slow RAM can't keep up with them anymore. So we have all this elaborate caching going in between. And if we can keep the cache fed, we can keep the CPU fed, we can keep your program running. At the point that we break the cache, we're going to start slowing down your program a great deal. And that's really what's going on in CPi... in the Galecmi is that the cache never gets to warm up. So let's... just as an example, these are all new slides I made this morning. So let's talk about... let's have... we have a program. We've got four cores, zero, one, two, and three, and we have the number two. And we're running the Galecmi version of CPi-FON and we're running our Fibonacci benchmark, which is using the number two a whole lot. So all of them currently have the number two in cache. So if they want to look at the number two, they can just look at it and they've already got it accessed, they don't have to wait. But then let's say that one of these cores is going to actually do something with the number two, so it's going to py increment the number. So it's going to change the reference count. Core one is changing the reference count, it's incrementing it. And that means that the number two has changed. That memory has changed, which means that it must now invalidate the cache for all the other cores for that cache line. The cache line is 64 bytes, which is more than enough to cover the entire long object, and so now none of the other cores have that number in cache anymore. And so the next time they want to talk to the number two, they have to go load it. Armin tells me that they can actually talk to the other core and maybe pull it, but it's still a lot slower than simply having it in cache ready to go. So this is happening constantly. Anytime that you examine an object in CPi-FON, you change its reference count. Anytime you change its reference count, the memory, anytime you change the memory, you are invalidating the cache for all the other cores, which means that the more cores you add, the slower you go. And that's what I'm observing in my numbers. So there is a solution for this, or at least a combination of approaches for a solution. There is a technique called Buffered Reference Counting. We're going to use this in combination with something else. So this is how it works now conceptually. These blue boxes at the bottom, these are supposed to be cores, lighter blue boxes with the O. That's representing an object O. So all of them are talking to this O directly. So right now, if you want to examine an object, you increment its reference count. When you increment its reference count, you just go and do it. You reach into the object, you change the number. That means that we have to synchronize that across cores, so we're using this atomic Inger and Decker, which is slow. We'd like to do something a little bit faster. So why don't we change it so that we can use... If we could change it so that all changes to reference counts were done from a single thread, then we wouldn't have to use atomic Inger and Decker anymore. We could just use what I would call unsynchronized Inger and Decker. It'd be a lot faster. We can do that. All we do is we change it so that instead of writing the reference count directly, we write into a law, a big memory buffer that just gets reference count changes in it. So every time you want to change the reference count on an object, you don't change it directly. Instead, you write in a log, you say, add one to the reference count. You just write that into the log and you don't worry about it. And meanwhile, there's this other thread, this fourth blue box where I wrote commit. That's the commit thread. That's the guy who's going to actually make the reference count changes. So he walks the log and sees, oh, I should add one to the reference count for oh, and he just goes and does it. But he's the only thread making reference count changes, so he can use unsynchronized Inger and Decker. That's great. The problem is we've all we've done is moved the contention. Now instead of having contention around the reference counts, we have contention around this log. So we need to lock and unlock the log. We really haven't solved any problems, but we can fix that. So let's go to a single log per thread. Now, when thread zero wants to increment the reference count on oh, it writes into this reference count log. And then the commit thread comes along and makes that change. Now we have a single log per thread, and we have a single thread making the changes. There's no synchronization overhead hardly at all. We need to have a little bit when we swap these buffers around. That's great. Now we have an ordering problem. Let's say that thread one is running along, and let's say that our object O is stored in a list, and this is the only place where it's stored. And all the reference counts have settled out, so there's a reference count of one on oh right now, and that's the reference where L, the list L is holding a reference to that object. So thread one comes along and it says, oh, I'm going to iterate over the list and just print everything in it. And then thread zero comes along later and it says, I'm going to clear list L. This means that the reference count log for thread one is going to increment and then decrement, and then later the reference count log for zero is just going to decrement. The problem is, what if we process the log for zero before one? We're going to decrement the reference count. I already told you the reference count was one, so it's going to drop to zero. We're going to deallocate the object, and now we're going to process the commit log for one later, and we're going to explode. So thread might be another object. Some crazy things are going to happen. It's not a good idea. We can solve that, actually. By the way, I want to make it clear if you were saying, well, what if you just swap those and you did zero in front of one? That's not a general solution, because you could have a mirrored thing across two threads. You have two lists, two objects, each thread increments over one of the lists and then clears the other one. You can't solve that by reordering the operations here. The answer is, consider that any two operations of Inker and Decker, if you have two operations, one of them is an Inker and a Nevecker, the other one is an Inker and a Nevecker, can you swap them? The answer is, in almost every case you can. If you have two Inkers, you can swap them, that's harmless. If you have a Decker followed by an Inker, you can swap them, that's harmless. The only time you have a problem is if you have an Inker followed by a Decker. You might have an incorrect program now. So, with this observation, we don't have to preserve very strict ordering on the operation of Inkers and Deckers, so we can do this buffered reference counting a lot cheaper by just having two different logs for each thread. One is an Inker log and one is a Decker log. All we need to do is be very careful that we process all of the Inkers before we process all of the Deckers, and now our programs can run correctly and we have almost no logging. This solves the problem of having an Inker and Decker around reference counting. We still have the problem about clearing cache lines, so we can solve that, too. There is a technique. Thomas Wooder's actually got this working in the Galactomy thread. It's not ready yet, I think, and he was taking kind of a different approach. He had this idea of having a different reference count for every object for every thread, and then there would be no contention. I'm not optimistic that that's actually going to work in the long term, but it's going to work for a buffer of reference counting. What we do is we take the object O and we break it into two pieces. We have the reference count separate from the object, and then we push them away apart in memory so they're not next to each other. Now the reference count is going to be on a different cache line than the object. If we combine that with buffer reference counting, now we have a single thread that's committing the changes and it's making these changes to memory that is way far removed from the object itself, which means that we're not invalidating and at that point I'm pretty optimistic that we can get a lot of this performance back. So, remote object headers, Thomas said he had working, so I'm optimistic that that'll work when it comes time to work with it. I've been trying to do buffer reference counting and fundamentally C-Python is allergic as it turns out to not having reference counts being accurate in real time. So it doesn't work right now and I'm going to have to have my head down and debug it for a week, and I just haven't had the week to spare recently. I'm pretty optimistic that the collect me is going to get a lot faster. So where do we go after that? There's an idea to make objects immortal or specifically reference counts that are immortal. If we had an immortal reference count then we're not changing the memory which means we're not invalidating cache lines. That can make things faster. Unfortunately it adds an if statement to basically every anchor in Decker. It's hard to tell without doing an experiment. Thread private locking, the idea here is that most objects never escaped the thread in which they were created. So if you create a dict and you only ever use that dict on the current thread then you really don't need the expensive locking operations around it. It's only when the dict was ever used by a different thread that you would have to actually really lock it and unlock it. And so if we could lock objects in such a way that the locking was basically free when it was thread local it would be an idea for how I think I can get that to work. I'm going to have to talk about garbage collection some day in the Galactomy branch but again it's going to be quite a ways away. But in order for it to be code that people can depend on CPython it's going to have to support garbage collection. I think there are a bunch of techniques for garbage collection that support lockless concurrent access. It's super advanced stuff. I completely don't understand it. Current CPython garbage collection is basically stop the world garbage collection that seems like it's acceptable and I think I can get that to work. I think the initial approach is going to be stop the world. And then if we get this all to work and CPython has this Galactomy branch that's actually a viable thing then the super brainy technologist can come along and fix my garbage collection. One idea by the way for making garbage collection not be so expensive. I think we could do the same thing with buffer reference counting. We could also have buffered tracking and untracking of reference counted objects. Just track this object, untrack this object, write it down in a buffer and have a commit thread that commits them later. Finally one guy Eric Snow I think suggested that as a way of mitigating the breakage involved around C extensions we could have the ability to auto lock C extension C called it into a C extension. There would be an implicit lock involved that would prevent more than one thread from running inside of the C extension at a time. And that can probably get a lot of C extensions up and running very quickly. Again this is going to be way far down the line before we're going to be ready to look at things like that. So my final thought for you is the journey of a thousand miles begins with a single step. The performance looks terrible right now but it is simply there's no way to get rid of the gill without starting to get rid of the gill and this is what starting to get rid of the gill looks like. So I'm still optimistic even though the numbers are terrible I'm optimistic that in the long run this is going to work. Thank you. So this is I think I have about five minutes left for questions. Thank you very much DS Dad. And let's see we have a question over here. You try also with other things other than Fibonacci something more complex computationally maybe. No nothing complicated so again so as it stands I've added locking around the dict object and the list object. So the dict is safe to use the list is safe to use number like integers and floats are immutable so those are safe to use anything that's mutable and not in the list that I just said isn't safe to use inside of the galactomy right now. So if you try and do a computation with a set object it's just going to blow up. So I haven't done any other programs because I didn't think they'd be all that interesting and again this is early days anyway. The really my hope is that there's a lot of work to be done around the galactomy adding safety to these other mutable objects like sets and byte arrays and all these sorts of things and once we got all those objects to be safe then we could run any CPython program and we could test that. So that's really where I've spent my time instead. Yes. You might correct me if I'm not right the stackless approach a couple years ago wasn't that also an approach to remove the Gil and can you compare that? No, stackless never attempted to remove the Gil. The original concept around stackless was an original original like a long time ago. Stackless has been around for a long time. The original idea with stackless was if you have a Python program let's say that it's heavily recursive and you run out of stack and then you get a stack exception I don't remember what the exception is because the way that function calls work in CPython is that they're actually implemented using C function calls so every time you make a function call in Python it turns into about four function calls in C and that's building up the C stack and then they eventually blow the C stack and you're out of memory. If we could separate those two so that whenever you made a function call in Python then we could make function calls all the live long day and we never run out of stack and then all the context for a function call lives in the stack and now we can very easily switch between function call stacks which means that we can have co-routines and so that was kind of the direction the stackless was going was just separating the C stack and the function stack the Python stack and they haven't used that technique for a long time to actually do these crazy stuff where they actually take the C stack and they copy it off memory copy it some over somewhere and then they use some of the language to change stacks like the change the stack pointer and the instruction pointer and jump into another co-routine but stackless is really more about co-routines anyway it's never been about removing the gil so with the approach of for example async.io and twisted and all those asynchronous networking frameworks that tend to handle their own they don't use threads basically so with the approach of a jelectomy based C Python so you would run like an async.io reactor in or async.io event loop like in each thread and then what sort of overhead would you be looking at just in theory for those reactors that that never ever talk between threads well the theory is that these would be completely divorced and adding more cores would make your program scale linearly in practice I don't think we're ever gonna get there but so the answer to that question is the answer to all the other questions about performance which is that the gilectomy becomes interesting at the point at which you can add cores to a program and it gets faster rather than slower and again it's gonna be a long time before we get there in general how does the gilectomy effect twisted and other async.io and async.io programming things I can only think that it would be good for them just like every other program in particular like that sounds like a reasonably parallel program like these things should run in parallel and the reason that we run in them on multiple cores the reason that we don't run them on multiple cores right now is because of the gil but they're already basically parallel operations anyway you're going to have to eat the blocking overhead of course but you're going to be able to have multiple programs multiple threads running simultaneously on the same code base with the same local data store all the local objects that are in CPython so I think it's like I'll put it this way if it doesn't make your program faster then switch to the single-threaded version you'll be happy okay one more question and Larry will be on the core developers panel later yeah I'd be happy to answer questions go ahead I'd be happy to answer questions about the gilectomy during the core developers panel which starts at 345 today and I'm chairing so I'm forced to attend and stay for the whole time thank you very much for the wonderful talk have you considered keeping C extension compatibility with for example like a global interpreter lock just for C extensions like with reader writer locks well I've considered it it doesn't work the problem any time the problem is that if you had a global lock that you just used for C extensions you have code that isn't paying any attention to it it's going to be changing state the C extension expected that the state doesn't change from underneath its feet because it's holding the lock right now now your program is incorrect so it's just not it's a non-starter okay thank you very much Larry let's give him a big hand