 I'd like to introduce Armin, who's going to talk about using all these cores that we're getting and is producing quite a challenge to our software, is one of the cleverest people I know, so, over to Armin. Thank you. So, I'm here to talk about PIPA STM, which is software transactional memory, and well, what it actually means is something that will become clear, I hope, during this talk. So it is a bit of metadata. It is an ongoing research project. It's done mostly by two people, me and Remy Meyer, who is at ETH in Zurich. It is a project that has been helped a lot by crowdfunding, so we got some almost $30,000 over the three years of the project. So thank you for everybody who did contribute. And it's a project that started at the European 2011 Lightning Talk. So maybe a few of you remember I was there, and I presented in five minutes how it would work. So yes, this is the result of three years later, basically. So first question, why is there a global interpreter lock when you run your Python code? Well, I mean, it came as a historical reason, but now it's really deeply into the language basically of the C-Python implementation. Well, C-Python started as a single thread program, and then, well, the easiest way to change a program to make it run on multiple threads is just put a lock around everything, right? You have one lock that is called the global interpreter lock, and it needs to be acquired in order to run any piece of code. So the result is that, yes, you can have multiple threads in Python code, but they are not being used for parallelism. You use them for concurrency, whereas the difference between these two terms really means that you can use the fact that you have several threads that will run independent pieces of code or do some external code or things like that, but they are not actually running in parallel. So this was really done because it was the easiest way to change the interpreter, and, well, it's also very easy for reference counting for all these kind of internal details where we don't have to care, basically. So this has positive and negative consequences. Positive consequences, well, it's simple for the implementer, but it also has some consequences that are visible for you, for the user of Python. These consequences are some operations are atomic, for example, list.append or dictionary setting an item into a dictionary or doing these kind of things. You know that even if you are running a multiple thread program, another thread that would also try to do an update to the same dictionary, for example, it's not going to mess up the internal state of the dictionary, and that would lead to completely obscure crashes, et cetera. No, these are not occurring in Python. All variables are volatile, basically, is what I could say here. And if you don't know what volatile means, it's perfectly fine. Now it has negative consequences, of course. The completely obvious one is that you don't have parallelism, actually. You cannot write a parallel program in Python. I mean, not one program in Python. Well, there is another consequence that the global inter-petalogue exists, and it's fundamental, basically, but it's not exposed to the application, which means that you get some of the benefits, like the atomicity of some basic operations, list or dictionary setting of this kind of stuff, but you don't get, well, you don't get larger scale atomicity or, well, you still basically need locks, even in your Python program. So you need locks, and then all the hard parts of locks, like dead locks and everything. So how do we remove the gill? Well, there are three approaches, I think. The first is called fine-grained locking, the second shared nothing, and the third one, which I'm going to present is transactional memory. So a bit of contrast, what are these three approaches? Fine-grained locking is, well, we have this interpreter which has the gill, so let's kill the gill. And instead, we are going to put very fine-grained locks on each individual object, right? So this is a lot of work for the guy who is implementing Python, I mean, updating Python, basically. Well, and also, if you are talking about sieve, and there are tons of nasty issues, like, for example, reference counting, et cetera, because, well, now you can have several processes that will, in parallel, try to update the reference count of the same object, so what is it? There must be atomic updates, but atomic updates are slow, and processors, et cetera, et cetera. So there are tons and tons of issues. The point is that it's an approach that actually exists in Python and Python, they are doing it basically, but they are benefiting, I mean, they, as in the Python maintainers, are benefiting a lot from the fact that the JVM and the .NET platform has really good support for exactly that. For example, the JVM hotspot will happily do lock removal on any object that it can somehow prove did not escape the current thread. Even in the case where it cannot prove things, they are doing kind of crazy things, like, they have five different cases, the first case I checked very quickly, and then you go into the slower and slower cases, et cetera, well, basically they are doing crazy things, and it's fine. It's fine, but, well, you still need application level locking, it does not solve that, right? If you are writing a Python program, and if you are running it on top of Dyson, let's say, well, you still need to carefully use threads and locks and everything in your Python program. So this is why there is a completely different approach that has some traction nowadays, because also because it can be used on C-Python, it's just shared nothing. Basically, the approach is that if you want to run some piece of code in multiple cores in your machine, you just start several processes, and then the processes, well, you need to design your program in such a way that it's possible to exchange some amount of data that is not too much. It is actually a model that some people think is a good idea, and I cannot agree, there are usages for which it's a good idea. It gives a clean model on a multi-threaded, well, a multi-core program. They don't have the issue of locks, and then there's a negative point that, well, you have limitations on what kind of data you can exchange between processes. You have to actually overhead for exchanging the data. Also, it's not compatible with an existing threaded application. Well, it's still a good model. That's not a 100-person solution, basically. So this is now what I'm talking about. This is, well, transactional memory is a way, well, it's a way to run the interpreter as if it had a gill, globally in the data lock. So the main difference is that instead of blocking, so when you have a lock that several threads acquire, it usually means that all threads will have to wait, and only one thread can proceed. But here, the difference is that with this kind of locks, you can still run all threads. But you do so in a way, well, you run all threads optimistically, and then, well, you need some bookkeeping to check what each thread really reads and writes, and then at some point you need to check if they actually did something that conflicted or not. And the hope is that in the common case, you did not actually get conflict, and in that case, everything works fine, and you succeeded in running your multiple threads in parallel. And in the hopefully rare case of a conflict, you need to cancel and restart one of the threads. So this is a very, very high-level view. I should mention that here I'm talking about STM, which is software transactional memory. There is actually HTM that exists as well. That's hardware transactional memory that's implemented and has well CPUs, actually. So the latest generation of Intel processors has HTM. Well, you also have some hybrids that are some clever implementation of STM that use HTM internally for some things. Well, most of these three different solutions are still mostly research only because, well, STM so far has a huge overhead, like the cross-checking of memory conflicts and so on has typically, at least 2 times, but more like 4 or 5 or 10 times slower, or at least 2 times slower. That's one, well, per core. Then you have HTM hardware transactional memory, which is, in theory, great, but in practice, far too limited. At least so far. I mean, in the current generation, it's too limited to support. I mean, we tried to write a PIPI HTM, but it's... It gives the result that if you run a transaction for one bytecode, then you have a bit of chance that sometimes it's not too long. Okay, so that's why I'm still focusing mostly only on STM now, really the software part. So yes, here in this slide, I wrote easy in quotes, right? So let me explain why it's easy and which part is actually easy. The easy part is to go inside PIPI and replace the gil, because you just replace the place that calls gil a query and gil release by, respectively, start a transaction, stop a transaction. Easy. Now the hard part is to actually write it as a library. How do you write all this STM and STM handling code? So, I mean, the point I'm making here is that if you actually think about it, if a gil acquire and gil release, that's also actually something that's not completely trivial, acquiring a lock. Acquiring a lock is done with a library, that is, for example, on Linux, the PFRED library, and the PFRED library itself is also a bit crazy. I mean, a lock is, yes, okay, how to do a lock naively, yes, just one word, and I put zero for not locked and one for locked, yes, okay, naive. Yes, it works, but actually it's not how it's done at all, because it's possible to optimize it far more using clever techniques. So, here I'm trying to make the same argument. It's easy to add correct calls to start and end a transaction in PIPI, but the hard part is to write the library. Okay, well, here I'm presenting PIPI STM, but just mentioning this library here, that is a hard part, could actually also be used in CPython, which would be great, then we get CPython with multiple calls, cool. But, well, there is one catch that it's, what do you do about reference counting? That's the main catch, well, there are solutions that are involved, hard, messy, et cetera. Okay, I have a nice diagram to show how it works. So, the basic idea is that in this diagram I have two threads, and I want to run things, that's a blue box, horizontal blue box, and well, if I was running this code with a normal Python with a build, then we would get the diagrams with the boxes that, well, they cannot run in parallel, right? So, they take more time in total. Here with STM, they actually run in parallel, but the point is that you are, each thread is running a bit independently by being very careful about what it changes, well, mostly all its changes are kept local. So, this is what is done on the first part of the box, and then at the end of each transaction, each box, we do it in a special phase that needs this time to be synchronized across all calls, we push all changes so that all calls will see it. So, the effect that we get from the programmer is as if, well, in this case, as if we had three independent pieces that were running, and these pieces were running here. Here, here, and here, only. So, basically, the three pieces run after, well, serially. You have the first piece, the second piece, the third piece, and it just happens that the piece, the preparation to the commit of the transaction occurred before. And maybe it did maybe a lot of work before. So, it means that a model, when you think about how it works, it is still essentially a serialization. So, you still have one thread, then another thread, then the first thread that can commit, so that means that can produce a result that you can see. So, in this sense, it's exactly the same as the GIL. So, a PyPy STM works, feels exactly like a regular PyPy with a GIL. You don't have any additional issue in any additional races, conflicts, et cetera. Okay, small demo. Here, I don't have this laptop PyPy STM, so I will just go through the sources. This is an example using, I mean, it's an example of a demo that was posted in the comments of the latest blog post. So, it's not from me, basically. If you have this is prime function, and you want to compute how many of the numbers up to 5 million are prime, well, you do it like this. And then, if you want to do the same thing on multiple processors, then, well, there are several ways. For example, you can use a multiprocessing module. Then it looks like this. Yes. I think we'll just switch to another tool. So, this is using the multiprocessing module. So, it's doing the same thing in, well, in ranges. So, it's really using several processors. So, this happens to work because this prime is simple enough. However, it's not actually doing the same thing because, well, because if somewhere in the PyPy prime.py file, if somewhere there was some global state or something like that, then it would not work the same way anymore because now you have, now you run two different processors. Okay, so, well, I cannot run, I cannot actually run them and show time, but it's on the order. Well, you have to believe me. This one runs about six seconds. Typically, it's 6.2 maybe. This one runs something like five seconds. I mean, the reason of that, right? So, it's not twice as fast. And then you can, then you can write a version using multiple threads. This is just bare-bone threads. It's starting here, import thread at the bottom, starting two threads. And then every thread, well, you have a queue to communicate the ranges. And every thread reads from the same queue, so we'll get the next range to do. And if we run this on top of a regular PyPy, then it gives us something like eight seconds. So, it means like six seconds, but with a bit more because there isn't a bit of overhead. And if you run it on top of PyPy STM, then we get 4.8 seconds, which is the fastest I described so far, which is cool. Okay. Now, here I'm cheating a bit basically because, yes, this example runs faster with two cores already, which is great. However, the numbers, all numbers here have been carefully tweaked to show this, right? Like, if you change the detail somewhere, then it's three times slower. But the point is that we are getting there. Okay. Now, what I just showed is, yes, you can use multiple threads on its cool and on its faster. Good. However, the main, the real point I'm trying to push forward in PyPy is not, well, it's not just that, yes, you can use threads on below in your maze of locks and debug them forever. The real point is that you can use threads and very coarse locking. You can have two transactions that run optimistically in parallel. I mean, here I've just thrown you this diagram. So here is another time, the exact same diagram. However, now every block is no longer just from one, from Jill acquire to Jill release at the end. Now it's from acquire some lock that you define in your application up to release these locks that you define in your application. So you see technically there is no difference, right? It's just one lock. It's just another lock. However, the difference here is that you can now start to think about your application as it's starting multiple threads. It's using just one lock, this time an explicit lock that you imported from the thread module, but just one. And everything it does in all threads, whenever it wants to do something, it acquires this lock, right? It's something that makes no sense at all, a theory. It's something that you wouldn't do in normal Python because, well, why use threads then, okay? But the point is that you can do it and then it can optimistically try to parallelize. So it means that, well, the thing I'm pushing forward here, the kind of program that I could foresee could be done with Python STM is really, well, you put, yes, it's still using threads and locks, but you put them in some corner of your application. So you have your program with just one lock and, well, you have completely caused grain locking, basically. Extremely. So, I mean, an example, here is an application that is traditionally not... Ah, ah, I don't have it here. I have to wave my hands, sorry. Wave more, my hands. So, yes, this demo is about bottle web server. I mean, it could be twisted, it could be tornado, whatever. So it is a web server which is traditionally not using threads, okay? It works like you get a request from an HTTP request and you are processing it and maybe you're doing some complicated computation and then you're pushing the answer. So the point is that if you just take one of these frameworks like bottle, you add a thread pool on top of it. So every incoming request, you ask the thread pool by pushing something in a queue, you ask the thread pool, now please process this request. So in that thread, you will actually process the request and send the answer back to the main thread with an other queue, for example. Then if you do that, then it can actually run, well, Pi by STM can run this program on multiple cores in parallel. And the point is that each of these single piece that run in the thread pool are run by acquiring one lock, one global lock. So it means that the different pieces appear to run one after the other. So it works just like you did not have a thread pool at all. But in this example, this is an example where it will clearly, most pieces will be able to run independently from each other. So yes, this is in summary what I'm trying to say is that the Pi by STM programming model, yes, it gives threads on locks that are fully compatible with the global interpreter lock. But here I'm not saying, now everybody should use threads on locks. Here I'm saying you should make or use a thread pool library and use only coarse grain locking, because that's enough for the Pi by STM. So yes, well you have three different kind of applications that are immediate. So you can have a multiprocessing like interface where you can use a pool of threads. I think there is actually in the multiprocessing thread option instead of a process option actually. But that's a bit pointless because that's not running things in parallel normally. Yes, the point is that here you could have a multiprocessing thread that works as I said by acquiring one lock. You can extend a twisted tornado bottle like I explained. You can also, well, this is maybe a bit further down the road, but it's always the same thing. Like if you have a stackless or green light or G event system. So it's a system where you have coroutines basically, but coroutines tend to do something independent from each other. So you can again do the same thing. This time you acquire and release a lock around the execution of one atomic piece of the coroutine, which means from one switch to the next for example. So yes, the end result would be again something that continues to work exactly the same way. I think you take your existing stackless application and it continues to work. But on multiple cores, yes, this is the current status. Well, the basics work. The best case of 25 to 40 percent overhead, which is much better than originally planned, I mean it's really good enough so that usually with just two threads it's already faster. So this is the overhead of unrunning only one thread. Yes, well, everything I just said before about the application level locks working the same way as a global interpreter lock is actually wrong, but it should be true soon. Now we have a walk around with atomic things that I won't explain. It's temporary. Yes, well, there are tons and tons and tons of things to improve. So yes, as a summary, this approach has potential to enable parallelism of CPU bound multi-threaded programs or I mean it can be used as a replacement of multiprocessing or etc. It can also be used in applications that are not explicitly written for that. Well, if you have basically anything that could potentially be replaced by a call to multiprocessing.pool.map. And yes, so the benefit is you keep the locks coarse grained. However, the issue is that you keep the locks coarse grained. I mean, this is something that has actually other issues that we as in me and me and Michael Walker were not very clear about so far. Will this work very nicely, completely in any case? I don't think so actually because you have the issue of if you are running things that should be running parallel, but are actually not because for example both pieces would increase some global counter. This is enough to make the pieces conflict and then if they conflict they are again parallel and serialized. So this is an example of systematic conflict. So it means that if you take your program and apply what I said and expect it to go just n times faster, then it may actually not go n times faster at all and the reason is probably because of systematic conflict. So it's something that we need tools to debug. We need a way to find where the conflicts are and figure out oh, but here I'm incrementing this global counter. Let's fix it in that way or this way. Well, debugging tools and profiling tools etc. All this is not done basically and all this will probably be very much needed. So here if you want to compare this approach with the standard multi-threading approach, in a standard multi-threading approach everything is fast, cool, but then things crash if you don't look correctly. Here at first everything is slow but works correctly and then you can improve. You can improve by detecting the conflicts and etc. But everything still works correctly as it works the same way as if you had only one thread. Which I think is a very good approach basically, especially for languages like Python. Yes, the performance I mentioned. Yes, so it's not production already. It's still alpha getting there as you can download at this URL the first release. It works on your Linux 64 for now. And yes, it's crowdfunding. Thank you. Thank you, Armin. We've got time for a couple of questions before the next session. If you've got a question, will you go to one of the microphones? Thank you. Hi. When you say coarsed walking, how coarsed or fine grained can it be? For example, login library has a global lock internally. Does it mean that logging something once a second would already hit this performance penalty or is it still okay? What are the lengths of the atomic block that... Where's the threshold basically? Can't hear it? How long is it in time? Okay, so how long can a coarsed lock transaction run? The point is that in the Python STM I try to enable arbitrary long. There is no limit. Ah, well, if you really do it too short, then you will have the GIL. The GIL will be applied for a longer time. So the limit is not important for the programmer. The GIL offers a lower bound and the size in the GIL is tweaked to get the best performance. It's tweaked to be long enough so that making it even shorter would introduce too much overhead. Can you hear this? So if you're running just in pure interpreter, is the overhead pretty much non-existent? Right. The overhead is... I can't tell so far because it's in development. How foresee it is that overhead should be like 25% everywhere. Yes, but no. Yes, the GIL will remove some of the overhead, but then the base line is also much shorter. So it's harder to improve. So it turned out to be somehow similar. Yes. So why do I need to remove rough counting from C-Python? Yeah. Yeah, maybe... I think I will take the question offline because this is just asked for a more complex answer. We'll go back to the second one that we tried. And then what do you do with... Yes, okay. The question is how do you roll back transactions that actually had already side effects? The point is that... The point is that a transaction should not have side effects. And that is actually... That fits very nicely in the model of C-Python because if you are going to have side effects like writes to a file, you know you'll read the GIL. So here it means you end the previous transaction. And you do is actually writing outside transactions. Can you hear me away? At the other microphone over here. In Haskell's STM, there are combinators such as a retry where you can take this STM transaction and say, okay, I would like to do this from the beginning. Do you intend to expose something like that? No. Because this is for low level. Here the real goal is to have STM internally and not expose it at all to the language. Good. Thank you again, Tommy.