 So, hi everybody. My name is Victor Stine. I'm working for Red Hat to maintain Python downstream, which means on Fedora and Red Hat on two-piles Linux. But I'm also maintaining Python upstream. For example, fix the CI to make sure that the CI is always green. And I'm here to talk about the Python performance past, present, and future. So I would like to start my journey with the very beginning of Python and what has been tried in the past. So Python got many different implementations. And the first one has been created by Guido van Rosum almost 30 years ago. This is called C Python. C comes from the language C. So it's written half in Python and half in the C language. A few days later, a second implementation has been written in Java. It was first called Python, but later renamed Jyton. This one has been created by Jim Hagenin. For concurrency, a project has been created with the name Stackless Python. Stackless means that you are able to switch two different core routines, and the Stackless Python makes it efficient. This one has been created by Christian Thiesma. And the same Jim created also Iron Python, which is written in a C chart on the Microsoft runtime. And more recently, we saw MicroPython, which is designed for microcontroller. And this one has been created by Damien George. To make Python faster, we have a long list of different optimization projects. The first one was the JIT compiler created by Armin Rigo called Psyco. This one was only for a single function, and it was using a decorator. So when you call the function for the first time, it is JIT compiled, and the next call will be faster. But Armin saw that this design is not really the most efficient way to optimize Python. So later, a recharge project has been founded by the European Union, and the PyPython project has been created, which is a JIT compiler for Python. Google also had an intern creating an laden swallow, which promised to make Python five times faster. Dropbox also tried to make Python faster using an LLVM JIT compiler called Piston. I think it had one or two people working on that for a few years. And Microsoft also had this project called Pigeon. And if you notice the dates, you can see that most of this project has an end date. So we will try to figure out why they have an end date. When you create a new Python implementation, you have two main approaches. The first one is to start from the current C Python, and the other one is to start from the scratch. If you start from the C Python, this is the approach chosen by and laden swallow, Piston and Pigeon. When you do that, the cool thing is that you directly get a support for all existing code because it's already C Python. You support the extensions, everything works, and you can just put your change on top of Python. But when you do that, you inherit all the legacy code of Python and all the old design of C Python, which again has been created 40 years ago. And maybe some technical change made 30 years ago, made sense 30 years ago. But today our CPU are more coarse, and some design doesn't fit well to scale. For example, in Python, we have something called the global interpreter lock, or the GIL. And the GIL basically limits you to one thread. But we have also some specific C structures, reference counting on a specific garbage collector, which also prevents us to implement some kind of optimization. On the other side, if you start a new implementation from scratch, the design chosen by PyPy and Jyton on Python, you don't have to do all this legacy code and all this old design. You can do whatever you want. And for example, Jyton on Python don't have any GIL. So they are a little bit to scale on multiple CPU from the start, thanks to the GVM and the runtime of Microsoft. Another example is that PyPy doesn't use reference counting internally, but they use a tracing garbage collector, which is more efficient. But when you start from scratch, the main drawback is that for the C extensions, either you don't support them at all, which can be an issue depending on your kind of application, or it is slower than C Python. And for example, for PyPy, they have a module called CpyX, which creates a Cpy object on demand. So when you access something which comes for our C extensions, they have to emulate the Cpy structures and API. And this new object has been, should be synchronized with the PyPy objects, which works, but the implementation is quite complex and maybe not the most efficient way to interface with C extension. And another issue of the different implementation of Python is that you are in a competition with C Python. And C Python has almost 30 active core developers to maintain it just to remit the pull request and to work directly, but obviously we have way more contributors proposing changes. And another issue for the implementation of Python is that all the new features first land in the C Python code, which means that the other implementation have to catch up to Python to get these new features. And so why would a user prefer an audited or incomplete implementation and who will sponsor the developments if you have another implementation? And what is very interesting with the previous project that I listed to make Python faster is that they wrote a summary of why they decided to stop the project or why the project is tall. And for the case of unlanded Swallow, there are three main reasons. The first one is that at Google, the Python language is not really used for performance critical codes. So making Python faster is nice, but it's not a priority for Google. And having a different Python interpreter caused some deployment issues. It was too difficult to deploy. Being a replacement, it was not enough. And I think that, for me, the most important reason is that our potential customers eventually find other way of solving the performance problem. Because in Python, when you are able to identify is a bottleneck of your performance, there are many options available to make it faster. You don't need to work on that. Another optimization project was a Python, which was developed for three years. And again, they wrote a report explaining why they decided to stop. And again, one reason was that at Grabbox, they started to rewrite the performance bottleneck code in other languages, such as Go. But they were also very optimistic about optimization that would be possible to implement in Grabbox. But they figured out that if you would like to run any kind of Python application, the backward compatibility and the compatibility with C Python to have a very close behavior is very difficult to get it right. And to make it faster. So to summarize, C Python remains the reference implementation, but it shows it's an old age. There are multiple optimization projects which fails. And Python is a drop-in replacement. It's four times faster. But it's not widely adopted yet. So I would like to ask why. Okay, let's move to the present. When you, again, when you identify the bottleneck of your code, you have different options to make it faster. First of all, please try PyPy because for many users, it just works. You replace Python with PyPy and your application becomes twice faster or ten times faster or even more. It really depends on your workload, on your kind of function or what you do in your application. And there are many users who just replaced Python with PyPy and it was super fast. And the very nice part of PyPy is that it is really fully compatible with C Python. But there are some issues which explain why it's not widely used. And I think that the first reason is the support for C extension. So as I explained, it's a little bit slower than C Python even if it has been heavily optimized last year. But I would like to talk about that later. It has also two issues which are the memory footprints, which is a side effect of the JIT compiler because these JIT compilers use memory by itself and you have multiple versions of the code. So in some workload, it can be an issue. And another smaller issue is the startup time. When you have, for example, a command line interface, you may want to run it fast. But the JIT compiler makes it a little bit slower than C Python. But if you use the case of Mercurial when you run the same command multiple times, one solution can be to have a server running on the background and a client will just connect to the server. And this is a design chosen by Mercurial and PyPy is efficient in this case. And now I would like to come back to the very infamous global interpreter lock. To try to explain it, let's say that you have three threads in your application and they are all CPU bounds. For example, you compute anything with integers or floating point number. And because of the GIL, even if you have something called threads, technically they are Python threads and your C Python is only able to run it once at the same time. So the efficiency is only one-third here because I have three CPU. But you have to understand that the GIL doesn't prevent you to write efficient codes because the issue is only for CPU-bound codes. But for example, if you have a different workload of using C functions, we don't require the Python GIL. You can release the GIL and it means that you can run two threads in parallel. For example, when you read a file from the disk, Python release the GIL for you. When you compute hash one hash, it's also you also release the GIL. Or when you compress data with BZIP2, for example, you also release the GIL. So in practice, there are many cases where you can really use multiple threads and the efficiency is optimum. But to come back to the CPU-bound issue, there are solutions and one easy way to use all your CPU is a module called multiprocessing. It makes the architecture way easier to spawn multiple tasks in different process. Because in each process, you have a GIL. So thanks to that, you are able to distribute the workload on all your CPU. And again, the efficiency is 100 percent. So the multiprocessing module works around the GIL limitation. And there are two very good news for you for the next release of Python. First, the shared memory is now supported. And this is very important when you exchange a lot of data between the main process and the worker. Because if you exchange, for example, a very large numpy array, instead of having to copy the large array between each process, you can just put it in the shared memory. And all process will get it immediately. And there is also an optimization on the PICL protocol. PICL is a serialization module used by multiprocessing to distribute data across the worker. And a modification has been merged in Python 3.8, which avoids the memory copies. So you can take your array and write it into a socket. And you don't need to duplicate the objects. Because previously, you had a very high memory usage peak just for this realization. And now it's more efficient. Another option to optimize Python is to use the siton. And the nice thing with siton is that you can take your current Python code and compile it with siton. It's a compiler, so it's only done once. And after that, you distribute the compiled code. It's not a cheat compiler. And if you do that, it's already a little bit faster. But if you add some annotation about the type, siton can generate very efficient codes, because it knows the internal of Python, and it can rely on the type to use the most efficient way to execute your code. And the nice thing with siton is that it will handle the CIP for you. So you don't have to worry about, for example, the Python version. You don't have to worry about reference counting, which is tricky to get it right. It's like using a memory allocator. It's better than if Python does it for you. So it's a very nice way to write a C extension. And I suggest you to write, to use siton instead of using directly the C API. If your application is mostly using NumPy, you have other options. For example, NumBar is a cheat compiler, which is specialized in NumPy, and it translates your subset of Python and NumPy into fast code. Fast code means, for example, that you are able to take your function and execute it with the gil released. And thanks to that, you are able to distribute a task into different threads and run them in parallel, and NumBar makes it really easy to do. And it's not only about threading, it's also about single instruction, multiple data vectorization. So it means that in a single CPU instruction, you can execute multiple tasks, which is very efficient when you have code, which is very using a lot of numbers, especially floating point number. For example, the CPU supports SSE, AVX, or the newer version of AVX. And when you do that, you can make the code up to, for example, eight times faster. And you can also use a GPU acceleration using NumBar. Which means to take your code which looks Python and execute it on your GPU, because the GPU are really, really fast to execute floating point numbers. And it supports NVIDIA CUDA, but also RM WorkM. And one of the issues that we had in Python one or two years ago was that we got many optimization changes proposals, but we were not able to decide if this kind of change will make Python faster or slower, because when we run the benchmark, the benchmark sometimes says slower, but if you run it again, it said faster. So it was really hard to take a smart decision. So I spent time to work on the benchmark suites, to make it way more stable, to be able to reproduce the results. And thanks to that, now we have the speed.python.org website. Here you can see the performance of the decimal module on four years. And telco is a benchmark to compute a lot of numbers using the decimal module. And the good news is that if it goes below, it means that it's getting faster. And this is very important for us to be able to accept or to reject an optimization. To summarize, PyPy doesn't require any code change. So please, again, just try PyPy on your code, because it doesn't require you to do any kind of change. The multiprocessing module scales with the number of CPU. If you are able to distribute workload into a different sub-process, the issue is that you have to serialize data, which can be a little bit expensive, but now we have shared memory and faster picker. You should use siton and not use the C API directly. And numpy makes numpy faster. So let's move to the future. I would like to come back to a point which become very important for me. It's the Python C API. Because as we saw, the Python C API is causing a lot of trouble with the extension, especially on PyPy. I think we have to fix this issue to make Python usable by everybody. And to explain you the issue, you have to know that at the early day of Python, the C API evolved organically, which means that there was no clear design of what should be public, what should be private, what should be exposed or not. And because of that, we exposed many internal functions by mistake. Because the design was basically that Python is made of many C files and just to use a function defined in a wind file and call it in a different file, you have to expose it somehow. And for convenience, it was convenient to just expose everything. And at the beginning, it was first used inside Python, but some people saw that it would be interesting to use it outside Python. So some people started to write the extension using that. And this is also part of the success of Python. Because you are able to use all existing C codes, it's really easy to put Python on top of that. And it makes Python very successful. For example, in the scientific world with numpy. But because of the initial design of the C API, we exposed way too many implementation detail. But before going into the detail, the first good news is that the situation went better in Python 3.8, which is the next release of Python. So to explain you previously, we are all in the file in the same directory, which means that if you would like to hide a function from the public C API, you have to opt out for that using an if def block to say that, oh, this part is private. Don't use it. And we also have something called a stable ABI or stable API. But to declare functions which are not part of this API, again, you have to opt out using an if def. And because of the if def design, sometimes we added functions by mistake to the stable API or we added a private function by mistake. So the solution for that was a work started by Eric Snow is to create a subdirectories. And I continued this work in Python 3.8 to have a C Python for the API specific to C Python. And the internal is an API that you should not use. But we decided to expose it anyway because for very specific use case like debuggers or profilers, you may want to access C Python internals. And to access the internals, sometimes you cannot execute codes. You cannot call function when you inspect the internal of Python. So we have to expose all these structures. And now the include directory is only what I call the stable C API. Stable means that you should be able to use the same API and multiple Python versions. And during this work, we succeeded to move many private functions from the public headers to the internal header. And we also started to move some structure like the private interpreter states to the internal API. So slowly, we can hide more and more implementation detail. And another good news, it's something related to the API. When you take your extension and you compile it, you get a binary file. And when you go to the binary level, there is something called the ABI, which is an application binary interface. And the premise of Python is that if you use the stable API, you are able to use the same binary on multiple Python versions. But there was an issue with that, is when you would like to debug an application, it was very painful to use a debug build of Python because the ABI was different. So if you have a Linux distribution and you would like to use a debug build, you are not able to load all the extensions because all the extensions are compiled in release mode. And the ABI was different. So because of that, you had to recompile all the extensions. If the extension doesn't have many dependencies, it's fine. But if the extension is like a GTK, which has a lot of dependencies and requires a special compilation flag, it can be painful to have to recompile it manually. So now the good news is that because the ABI is the same, you can just use a debug build of Python and you don't have to recompile anything. And a debug build is a build which has many more sanity checks. And this is really efficient to detect many bugs in the extensions. To come back to the API issue, to try to explain you why it is causing us so much trouble, you have to understand that the C API is not an issue only for PyPy, but the C API is an issue for C Python itself. Because of the C API, we cannot implement many simple or obvious optimizations. And I use the example of a specialized list because I can explain two different problems of the C API with this example. To explain to you what is the specialized list in Python, when you create a list, it's basically in memory, it's an array of pointers. That's it. Technically, it's a pointer to PyObject. So it means that if you have a list of integers, for example, you have to first go to the list. From the list, you follow the pointer and you go to the number. So there is an indirection. But it's also used more memory because you have an array and for each item you have a second object, which means that memory footprint is not most efficient. And in PyPy, they manage to implement a different strategy is to have specialized lists. So the array of the list is an array of integers. So you don't have this second object, you don't have this indirection. Everything is directly in the list. And this is a very smart strategy. It's very memory efficient. If you have a list of integers, it's really convenient to do operation directly on integer. You can also specialize your code for this kind of list. But can we do that in Python? Can we modify the Python and the PyList object? It's not really easy. There are two problems which prevent us to do that. The first problem is that there is a PyList getItem macro, which access directly into the array. So you have your array of pointer and the macro goes directly into the array. And the extensions must not access the structure directly. Because if you do that, we cannot modify the structure. If the array becomes an array of integer instead of an array of pointer, you get the wrong type and you get a crash. But maybe we can modify the macro to say that, oh, if it's a specialized list, maybe in that case you return an integer. Or if it's an object, return an object. But again, there is an issue with that. Because the PyList getItem macro return a borrowed reference. To explain you what it is, usually when you call a Python function, like, for example, PyLong getLong to create an integer or PyString from string to create a string, you get something called a strong reference to the object. It means that when you are done with this object, you have to call Decref to say that I'm done. Python, you can release a memory. But a borrowed reference, it's something different, is a pointer to the object. But you must not call PyDecref. Because if you call PyDecref, you may get a crash. But what if we modify the PyList getItem macro to create a temporary object? So if you imagine that we have this list array of integer, and when you access into this array, you create a PyObject for this integer. Maybe we can do that in the macro. But what happens with this object? When should it be destroyed? We don't know. And this is an issue with borrowed references. We don't track the lifetime of the objects. And this is a big issue when you implement a language. So in PyPy, they manage to implement some strategy to support that. But it's a little bit tricky to get it right. It's maybe not the most efficient way. And the solution would be to just avoid borrowed references because it caused a lot of troubles to many people. So maybe we can take this API and make it better. So for example, we can try to hide structure fields, which means making structures opaque. And for example, this code must fail with a compilation error because on the second line, we access directly into the PyObjects. And this is a thing that I would like to prevent. Because if you give the ability to see extensions to inspect the internal Python, it means that we cannot modify the structures like PyObjects or like PyList object if you would like to implement a specialized strategy. So somehow we have to make sure that you never access any structure. Another change would be to remove all functions using borrowed references or even worse functions which are stealing references, which is another beast. And we should try to replace all macros with function calls. Maybe we will have some troubles because we already tried to modify to make significant changes between two Python versions. It was Python 2 and Python 3. And it didn't work very well. So the issue with modifying the C API is that people are actually using the C API. And there are many, many extensions on PyPI. And in my experience, every time that we make a change in Python, we break an unknown number of projects. So if we modify the C API, we will likely break many projects. So maybe we can try to fix most of these projects, but it will take a lot of time. So we have to do it anyway, but I'm not sure that it's the best strategy for the short term because maybe we can even not fix all issues of the C API. And it will just take too long. So maybe there is another way to do that. The good news is that the PyPI developers are already thinking about these issues for two years. And they came up with an idea of a PyUndle API. So that would be a brand new C API. And the cool thing is that it would be correct from day one. Which means that you avoid the borough references issues. You avoid all mistakes of the C API. And from the beginning, you start with a very clean and well-designed API. So in short, the idea is that instead of exposing everything in Python with a pointer to an object, you use an opaque integer. It's something similar to the UNIX file descriptor or to a Windows hundle. So we would have something like the open to create an hundle that you cannot inspect what is inside the hundle. You only have operation on that, like written off of writes for a file. You can duplicate the hundle. And when you are done with the hundle, you just close it. And the nice thing with these ideas is that it's something doable. For example, for C Python, it would be written on top of the QNC API. So we can do it right now as a third-party project. For PyPy, it would be more efficient because it doesn't expose the internal of PyObjects. And the idea of hundle on the ability to duplicate a hundle, this is something which plays very well with tracing GC, which is a garbage collector used by PyPy. And the other very good news is that if you are already using Cyton, you would not have to modify your code because we can modify Cyton to generate the directly code using PyHundle. And now I would like to talk about something else. We saw the C API. But there is something else in the C API which is the reference counting. So to explain you the issue, Larry Hastings created a project called Gelectomy three or four years ago. And the idea was to remove the gill. So in practice, it means to replace a single lock with a smaller lock, one lock for each mutable object. Replacing the lock is doable. And he managed to do that. But the goal of the project was to make Python faster. And he got some performance bottleneck on something called the reference counting which is a way to track the lifetime of objects. So each time that you access an object in Python, you have to increment the reference count to say that you are using this object. And when you are done, you decrements the reference counts. And it means that each time that you access an object, you have to modify an integer. But if there are many threads which access to the reference counts, you have to make sure that the reference counting is consistent. So two threads might not identify the integer at the same time. And one strategy is a current one is to put a giant lock. Or maybe you can put one lock per object for each reference count. So Larry tried different strategies. For example, use the CPU atomic operation. The idea of this CPU instruction is that you put a lock on a CPU cache line to make sure that only one CPU access this cache line at the same time. And this is fine when you access different objects in the address space of a process. But if you access objects which are closed, you may quickly go into a performance bottleneck because you lock the same cache line and all CPUs are stored because they are waiting for the lock. Another strategy tried by Larry was to not write directly into the memory, but create something called a log of in-cref and decref operation like a queue of operation. And the nice thing is that if you in-cref and decref, when you read the log at the end, it means that you don't have to modify the reference counter because the in-cref and decref is a new operation. But again, the complexity of the solution and the cost of at runtime was not really efficient. So I don't want to elaborate too much into the detail. You just have to recall that reference counting doesn't scale well with the number of threads. So it's a very good solution when you have a single thread, but if you would like to spawn an application into multiple threads, it's not the best option. So one option that we have in Python is to get rid of the reference counting inside Python, which will replace the current garbage collector with a tracing garbage collector because this is this is a well-tested strategy. For example, PyPy is using that with success, and many modern language implementation are also using the tracing GC. And the nice thing is that we can use it inside Python, but outside Python, we can continue to use reference counting. This is not an issue for the performance because what is the most important is really the internal of Python. And another project to make Python faster is to create sub-interpreters. This is an idea of Ericsnow is to use multiple interpreter to add support for multiple interpreter in the standard library because, in fact, we already support something called sub-interpreter, but the API is currently hidden in the C language. You have to access the C API for that. So Despep wants to make it accessible in Python. And the main motivation for that would be to have not a single global interpreter lock, but have one lock per interpreter, which means to be able to run each interpreter or interpreter in parallel. But this is a work in progress because we have to refactor a lot of code to make it possible. And the internals of C Python sometimes are very tricky to be modified. So to explain you the idea of sub-interpreter, I would like to come back to the issue of the GIL. So if you have a workload which is CPU bound, you are only able to use one CPU because, in fact, the design of Python is to put all threads into a single object called interpreter and to handle the consistency of objects. In Python, we put a GIL on top of that, which is an easy solution, but not very efficient because we cannot use all CPU. So the idea of sub-interpreter is that you continue to get your multiple threads, but you put each thread in a different interpreter, and each interpreter is able to run in parallel. And thanks to that, again, we can reach 100% efficiency. And my expectation from this design, if you compare sub-interpreter to multiprocessing, is that because everything is a single, is into the same process, I expect that we can share more memory between the different threads. And so I expect a lower memory footprint, because when you have a very large application spawning 10 process or 20 process on even more, slowly you can reach the limits of your hardware or you may have to pay for a more expensive flavor of a virtual machine on a cloud. And so the expectation of a lower memory footprint is that you will have to use less memory and so pay less on the cloud, for example. And I also expect that we will get faster locks because, again, we are into the same process. So it's easier to do things, and I hope that we can do some locking without any system call. But one of the main limitations of the sub-interpreter is that you cannot exchange objects directly between two interpreters, because to be able to get full speed for a single process, a single interpreter, you must not share anything between the two interpreters. So each interpreter should be isolated from the other one, because as soon as you have a common resource, you need some kind of locking for the consistency. But obviously, we can imagine some solution, like, for example, share memory to pass data with a very low other rate. So to summarize the future of Python, in my opinion, the QNC API has design issue, and it will be tricky to fix them. The reason is the idea of creating a new API called PyHandler. We have to move to the replace reference counting with tracing garbage collector in C Python, and there is the idea of a sub-interpreter in Python. So to conclude, there are many previous optimization projects which failed. Right now, you have a site on multiprocessing and Numba, which are working very well to make your code faster. And there are very promising projects, which are the PyHandler API, tracing garbage collector and sub-interpreter. If you would like to know more about all these topics, here is my list of links to get more information. And I'm not sure if I have still time for questions. So thank you. We cannot hear you. Thank you very much, Victor, for this amazing keynote. We will have a few minutes for some questions, and I will currently ask the audience to come to the microphones here and there for your questions. Yeah. Thank you very much for the very interesting talk. Seeing all those changes, do you see, like, in the far future, emerging C Python and PyPythes at some time in the future? Of course, now you have a lot of changes that go in direction, like PyPy is doing with the garbage collection and so on. I'm not sure that you understood correctly. You are asking if this change benefits most to PyPy. If you would see, in the far future, a promise to merge C Python and PyPy and make it one project, eventually? I think that one easy solution is to remove C Python and just use to PyPy. Because PyPy is way more efficient, but right now I'm not sure that we can do that because of the few issues that I listed of PyPy. And I think that the most common issue is the C extension, which are slower. And depending on your workload, sometimes it can be really the bottleneck of your workloads. So right now, we have to keep both. But maybe in the future, we can at least converge to the same solution. Because we are discussing between C Python, PyPy developers, and we all agree that we have to move to the same solutions, especially for the PyPy API. Thank you. We have one more question over there. Thank you for the very interesting talk. I have a question for a very similar topic. You were kind of wondering why PyPy wasn't adopted more. And don't you think that's mostly due to the fact that it just isn't the reference implementation, so people will default to C Python? I'm not sure that I understood correctly your question, sorry. Is the reason that PyPy isn't adopted more not just simply due to the fact that C Python is the reference implementation, so that people have to choose and don't at the moment have any memory, any performance problems, they will choose C Python by default. So what I understood is the question is why people don't pick PyPy instead of C Python? Yes. And what I understood is that if PyPy would be just way better, people would just move to PyPy. No, I'm not talking about better, but just the fact that C Python is the reference implementation by declaration. As I said, one of the reasons is that new features first land into C Python. So people sometimes want to get the latest version of Python to get the latest features, but in PyPy they have a smaller team, so they cannot implement them as fast as C Python. So it's not an easy issue. I think we can maybe continue the discussion in the hallway because we are out of time.