 So, welcome back everyone. So, we have our next keynote talk by Victor Srinath. So, Victor is a Python developer for nine years and he is currently working at Red Hat. Victor has been contributing to open source software since 2002 and was nominated for Google Open Source Fear Bonus. One of his greatest projects named OpenStack was nominated for the Infinite Rebase Shared in April 2016. Victor is also the author of Piper, Paul Handler, and Ray Smollett's module. What do you Victor? The stage is yours now. So, hi everybody. My name is Victor Stiner. I'm working for Red Hat and I'm here to talk about the Python performance, the past, present, and future. I'm going to shut down the video because I have some upload issues of Python performance. There are multiple implementation of the Python language. There is C Python, which is the most famous because it was the first. And it is also the reference implementation nowadays. This one was created by Guido van Rosso. It was created 30 years ago. After this one, Jim Eginen created a G Python, later renamed just Jyton. This one is written in the Java language and C Python is written in C. After this one, Stackless Python was created by Christian Tissmer for a different purpose. The idea is to run different coatings in not in parallel but to... It's a concurrent programming. It's very similar to AsyncIO but using a different implementation. There is also Arun Python, created also by Jim Eginen. And this one is written in C-sharp using the Microsoft CLR. So, first target in Windows. And more recently, there is a new implementation called MicroPython by Damien George. And this one is targeting embedded devices with very low disk space and very low memory, which means less than 1 megabyte of memory. This project is very popular and I think it was the one running in the Lightning talk just before. So, there are different projects to optimize Python since the creation of C Python. The first one was created by Armin Rigo and it's called Psyco. The idea was to annotate one specific function and to optimize it when you executed multiple times. So, after a specific number of executions, the function is optimized by a JIT compiler. But Armin understood that this design is not efficient enough to make a Python application like two times faster. So, he created a new project called PyPy and this one is very different because it's a new implementation. There was a project started by two employees of Google called Unladen Swallow. The goal of the project was to make Python applications running four or five times faster. And they decided to use LLVM. Another project started by one or two employees of Dropbox and is called Bystone. And there is also a project started by two employees of Microsoft called Pigeon. So, you can see that there was many attempts to optimize Python. And I would like to notice that many of these projects except PyPy have an end date. And the end date means that the project development has stopped for some reason. And we will see why they stopped. So, there are two main approaches to optimize Python. The first one is to take the existing C Python for the project and attempt to implement your optimization. And the other approach is to write a new implementation from scratch. So, for the first approach, for King C Python, there are Unladen Swallow, Pystone and Pygion which took this approach. But one of the drawbacks is that the performance will be limited by the old C Python design. And again, C Python was created 30 years ago. For example, there is a specific memory allocator. There is a specific way to store object memory, the C structures. There is reference counting to manage object lifetime, which means a specific garbage collector, a specific way to detect when the object should be described. And one infamous limitation of C Python is also the global interpreter lock known as GIL. And because of the GIL, C Python cannot run more than one thread at the same time. The seventh approach to optimize Python is to write a new implementation from scratch. And, for example, there are PyPy, Jyton and IronByton which took this approach. And the great thing with this approach is that, for example, Jyton and IronByton have no GIL at all, which means that you can really run many threads in parallel. And these threads are running on multiple CPUs at the same time. It's not concurrent programming, but really parallelism. Another nice thing is that you can use a very different way to implement Python. For example, PyPy uses an efficient tracing garbage collector. It doesn't use reference counting. And the nice thing with tracing garbage collector is that the object can move in memory, so you can shrink the memory to reduce the memory footprints when you deallocate many objects. And also that reference counting as a drawback is that when you run two Python threads in parallel or even more, the reference counting can become a bottleneck because two different threads are accessing the same memory. And each time that you have a race connection, you have to protect the memory by atomic operation or locks. So the reference counting is not efficient for parallelism. But this approach as a main drawback is that C Python is very famous for C extensions, like a numpy, for example. And these implementations written from scratch have no support of the extension or limited support. And even if they support the extension, usually it's slower than C Python because the C API is not native, but they have to emulate the C API. For example, for the case of PyPy, there is a module called a Cpy extension which creates a C Python PyObject on the band. And this object has to be synchronized with PyPy objects. For example, imagine that you have a list which is using a very efficient storage for PyPy. It is efficient for the JIT compiler. It is efficient to reduce the memory footprint. But the first time that you use a C API on this list, PyPy has to convert the whole list to C Python PyObject, which is using more memory and they are less efficient for the PyPy JIT compiler. And the conversion from PyPy to C Python layouts requires to copy memory but also to convert the values, so it's not efficient. Another issue with the other implementation of Python is that when you have a different implementation, you are in competition with C Python. C Python project has around 30 active core developers to maintain it. And a bunch of core developers are even paid to do that, which means that they are working for companies who allow them to spend, for example, one day per week to maintain Python. In my case, I'm paid by Red Hat to maintain Python upstream but also downstream on Federa on Rare. And the new features coming in Python 3.9 or Python 3.8, etc., that you can see in the what's new documents of what's new in Python 3.9, for example, there is a long list of new features and all of these features are first implemented in C Python, which means that PyPy, for example, has to reimplement all these features but with a smaller team. So what you can see is that PyPy is always a little bit late in terms of Python version and Python features compared to C Python. I think that today they are supporting Python 3.6 and there is a beta version of PyPy supporting Python 3.7. And on the other hand, C Python is going to be released in version 3.9 in a few days, like next Monday, I hope. So the question is now, why would users prefer an outdated and incompatible implementation of Python and who is going to sponsor the development of different implementation of Python? In the case of the Nladen Swallow project created nine years ago by Google employees, the nice thing is that they wrote a report explaining why they stopped the development of the project. And I identified three main reasons. The first one is that most Python code running at Google isn't performance critical, which means that the performance critical code was written in a different language and Python is not really the bottleneck. So there is a little benefit to make Python really faster. The deployment of Python and Nladen Swallow was too difficult. Being a replacement was not enough to make Python more popular at Google. And our potential customers eventually found other ways of solving their performance problems. And this is something that we see often in Python, is that Python is not the most efficient programming language, but there are many ways to work around this limitation. For the N, Python itself, the Python interpreter, like a C Python, is not the first target when you would like to optimize something. In the case of Python, the project created by Dropbox three years ago, there was also a report explaining why the development has stopped. And I identified two main reasons. The first one is that Dropbox has increasingly been writing its performance-sensitive code in other languages, such as Go. And the other reason is that we spend much more time than we expected on compatibility. This is also an issue that I am seeing often on the other implementation of Python, is that there are many ideas to optimize Python, to make it more efficient. But when you optimize something, it's not uncommon that you change the behavior in a subtle way. And the issue is that a large project like Django are really based on the assumption that Python behaves exactly as a C Python. So when you optimize something, you have to really behave exactly as a C Python. And this is something that the PyPy developer, they paid a lot of attention to really mimic the exact behavior of C Python, and they spend significant time to really be exactly compatible with C Python. So the summary of the past section is that C Python remains the reference implementation, but it shows its age. So there are multiple optimization projects, but almost all of them failed. And the remaining one PyPy, it's a drop-in replacement of C Python, and it's around four times faster. But it's not widely adopted yet. The question is why? So let's move on to the present section, the present of Python performance. So to optimize your code, the first thing to consider is that you have to identify the bottleneck of your application. And let's say that you identify your bottleneck. And the question is that how can you optimize this bottleneck? How can you make your code running faster? The first thing that you have to do is to just try PyPy, because PyPy just works. PyPy is a drop-in replacement for C Python. It's around four times faster than C Python in average. But I would like to add that the exact speedup really depends on your workload. So it depends on which code is running, how your code is designed. But the great thing is that PyPy is fully compatible with C Python. And what I heard from PyPy is that sometimes there is a small part of your application which is running slower than on C Python. And in this case, you can ask PyPy developers for help and they can explain you how to make your code more efficient on PyPy. But there are some issues with PyPy. The main one is that the support of C extension using the C Py extension model is almost as efficient as on C Python, but sometimes it's slower. The great news is that two years ago, they organized a sprint and they managed to heavily optimize the code. But again, it's still slower than C Python. And I would like to discuss the C API later to explain this issue. Another issue with PyPy is that because of the JIT compiler, the memory footprint of your application is larger when you use PyPy than when you use C Python. It can be an issue if you would like to spawn many processes on your server and your server has a limited amount of memory. And the last issue which I would say is a minor issue is that when you use PyPy, the startup time, the time just to start your application is usually longer because, again, the JIT compiler, if your application is running for hours or for days, you will not notice and it's just fine. But there are many usage of Python and some people are using Python for command line programs which are running for a few seconds. And in this case, the startup time can be a bottleneck. Another common way to optimize Python is that once you identify your bottleneck of your application, it's common that you only identify a few files or a few classes or functions. And you can start with specific functions and try to revise them in the C language or REST by writing a C extension or REST extension. And by doing that, you can write way more efficient code because in C, you can make more assumptions and you can use more efficient way to implement the same feature. And today, REST is becoming more and more popular and there are two ways to write REST extension. The first one is REST C Python. The other one is PyO3. I never tried this one, so I cannot say which one is the best. So just try out and make your own opinion on that. The next property of REST is that the REST compiler provides a guarantee that if you write properly your code using the REST memory model, the compiler can tell you in advance that your program will not have memory errors like buffer overflows. There are many ways to ensure that there is no race condition and this is really a great video of the REST language, but you need REST C Python or PyO3 for the glue between Python and REST. And for example, the Mercurial Projects, which is a source tracking NCM source control management similar such as Git but written in Python. For the Mercurial Projects, there are some functions which are using heavily the CPU, which are CPU bound. And in that case, it's very interesting to rewrite some part of REST. So there is an ongoing effort to rewrite the performance bottleneck of Mercurial in REST and so far the project is quite successful. So let me come back to the infamous global interpreter lock called the GIL. So in C Python, there is a lock which prevents you to run many threads in parallel, but it depends on your workload. In the case of mathematical functions, I would say in general a pure Python code which is described as a CPU bound. CPU bound means that the performance of your application is not limited by the input on output, the IO, but it is only limited by the speed of your CPU. So for a CPU bound workload, C Python is not efficient because you can only run a single thread at the same time. Even if your machine has three CPUs and you wrote your code using threads to run them in parallel, in practice using C Python, there is no parallelism, there is only concurrency. It's one thread after the other. So if you imagine a machine with three threads and three CPUs, here the efficiency is only one third. But the GIL is not a bottleneck for any workloads. If your application has threads and the threads are more IO bound, for example, one thread is reading a file, another thread is computing the SHA-1 checksum of the file contents, and a third thread is compressing this data using bzip2. If you have this kind of workload, you are not limited by the CPU, it's more limited by... In this case, you can release a GIL, and if you release a GIL, you can really have a parallelism and execute all threads in parallel. So if you imagine a machine with three threads and three CPUs, in that case, you have an efficiency of 100%. If your workload is really a CPU bound, there is one simple solution, which is called the multiprocessing module. If you use the multiprocessing module, you can have one thread per process and run many, many processes in parallel. And thanks to that, the operating system is able to execute the process in parallel and really use the power of all your CPUs of your machine. And the multiprocessing module is making that easy, so you don't have to manage the process yourself. The multiprocessing takes care of spawning the process, sending the data, retrieving the results, and stops the process when the workload is done. Thanks to the multiprocessing module, for a CPU bound workload, you can have again an efficiency of 100%. And this module is already existing in Biden for many years, so it's ready to use solution. So the multiprocessing module, it works around the yield limitation, the global interpreter lock. And the great news is that in Python 3.8, released one year and a half, released one year ago, sorry, you can get shared memory. And shared memory means that you don't have to serialize and copy the memory between the different workers. You can just use a chunk of memory which is accessible by all the workers. And thanks to that, you don't have to copy anymore. So it's way more efficient for very large amount of memory, like very large matrix in NumPy. And the second great news is that again in Python 3.8, there is a new particle protocol, the particle version 5. Again, this new version avoids copying very large objects. 5.74. The idea is that you can decide how you send a large amount of memory. And for example, you can delegate as a serialization using shared memory, so you don't have to copy the memory. You can write your own code to decide how to send the data. Previously, I said that one way to optimize a function is to rewrite it in the C language, the C extension. But if you use the C API of C Python, it can be very boring because you have to manage the memory yourself. You have to manage the reference counting. You have to manage the exception. Check for failures of each function. And the C language is less... It takes more lines to do the same thing than in Python. So the idea for this item project is that you write code such as Python. So the syntax of your code is very similar to Python. But you add a few annotations. And thanks to this annotation, the site is able to produce way more efficient code than you would get if you write it in Python or in C yourself. And the great thing with Cytan compared to using the Eclise C API is that if you use Cytan, you can easily support multiple Python versions and also get a better support of PyPy or other Python implementation using the same code base. So you don't have to manage the very subtle differences of the C API between the different Python versions. And the other nice property of Cytan is that you don't have to handle the reference counting manually. Cytan is doing that for you. And also the only exceptions are many things which are very boring. So you don't have to write all the exposure plate code. And the last great property of Cytan is that the Cytan optimizer emits efficient code using Cytan internals for you. So you don't have to know how Cytan is implemented. Cytan does that for you. And thanks to the knowledge of the Cytan internals, Cytan, sorry, is able to produce way more efficient code than the code that you would write yourself. Another project to optimize your code is Numba. Numba is different, is more specialized to scientific applications. For example, code using a Numba. So Numba is a JIT compiler translating a subset of Python and NumPy into fast code. And there are many ways to execute the same code faster. There are different implementation. There is simplified threading which is a way to run threads. But these threads are releasing the gil. So if you recall what I explained previously, if you release gil, you are able to use all the CPUs of your machine and you get a parallelism. Numba is also able to emit single instruction, multiple data vectorization, which is a way to run the same code way more efficient. There are many CPU extensions for that, like AVX, AVX 5.1.2. And the last way to optimize code is the famous GPU acceleration like NVIDIA CUDA. But Numba also supports AMD HockM, which is another way to run your code on the GPU. And all the solutions are way more efficient than the code that you would run usually using a numpy, for example. And the very nice thing with Numba is that you don't have to rewrite all your code from scratch. What you have to do is just to annotate a few functions using a little decorator and that's it. To come back to see Python, there is a website called speed.python.org which is tracking the performance of Python over time. The really nice tab is a timeline where you can see the performance over the last five years. Here is an example of Telco benchmark which is a benchmark on the Desimine module. It's a benchmark to compute the sums of numbers for a common benchmark called Telco. And here you can see that all the five years as the performance of Python is becoming way better, which is a great thing that we don't regress but become more efficient. When I spend a lot of time to make this benchmark more stable because previously we got a lot of noise in the results, a lot of spike faster or slower. So what I did is to write a new module called PyPath, which is a way to run the benchmark differently to get more stable results and more reproducible results. And after that, I modified the Cpython benchmark suite called PyPath month. I modified it to use PyPath. Thanks to that, now the results are more and more stable, so it's easier to use a result to take a decision. In summary, for the present section, PyPath doesn't require any code change, so you have to test PyPath on your code and see that it's faster. If you identify the bottleneck of your application, you can attempt to rewrite a few specific functions as the C or REST extensions. Multiprocessing scales, it's really enable parallelism on many CPUs, and it is really easy to use. To write the extension, use Sight and Please, don't use the C API directly. And Numba makes Numba faster. Okay, we saw a lot of existing solutions to make your Python code faster, and I would like to show you a few experimental projects to attempt to optimize Python even more. I spend a lot of time to analyze why C Python is slow, and why the optimization project failed in the past. And I think that I identify one big reason, which is the C API of C Biden. The C API evolved organically, which means that internal functions are exposed by mistake. And this C API was first written to be consumed by C Biden itself. There was no overall design. The design is just expose everything and don't think about if it's a good idea or not. And 30 years ago, it was just simple and good enough. But what we get today is that the C API expose way too many implementation details and because of that, it's way more difficult to optimize Python. So in Python 3.8, I did many changes in the C API and I'm still working on that in Python 3.9 and in the next Python 3.10. And one big change of Python 3.8 is that instead of having a single API at the same place where you get private function, internal function and public function, I split this API in three parts. So the main include directory is a public stable C API. The C Python directory is a C API specific to C Biden. And the internal sub directory is an internal C API. So this API should only be used by C Biden itself. But I decided to expose it anyway for very specific use case such as a debugger or a profiler which has to inspect the object of Python without executing Python because a debugger should not modify the state of an application but only inspect what is a current state. And many private functions are fixed on underscore pi and the pi interpreter state structure have been moved to the internal C API in Python 3.8. Another interesting thing in Python is a stable ABI. The idea of the stable ABI is to support multiple Python version. For example, Python 3.8 3.9 and later Python 3.10. Using a single binary. So the idea is that you build your six tension once and you can use the same binary on multiple Python version. So for example, for a Linux vendor instead of having one package for Python 3.8, one package for Python 3.10 which becomes annoying when you would like to support it. Thanks to the stable ABI you are able to ship a single package a single binary and support a large number of Python version. So what changed in Python 3.8 is that the debug build is no ABI compatible with the release build which is really interesting for debugging purpose because when you get a crash in Python it's very likely that the crash doesn't come from Python code but more from the C extension and if you use the debug build you get way more sanity check and running at runtime to detect bugs and thanks to this check it's more likely that you get more information about the crash and thanks to this change you can use your existing C extension compiled in release mode using a Python compiled in debug build because previously in Python 3.7 you had to recompile all your C extension and that can be very tricky if you get many requirements many dependency like header files or compiler or something else that will build your C extension So to come back to the C API and why it is an issue to expose the implementation detail you would like to take the example of a specialized list in C Python a list is basically an array of pointers an array of pointers to py object which is a real content of an object but in PyPy there is a strategy to get specialized lists so if you consider a list which only contains integers PyPy is able to store this list as an array of integers so there is no indirection from the list to an object which is stored somewhere else in memory and you get directly the value of the item directly in the list and thanks to that you get lower memory footprints but you also get better performance because there is no indirection so the CPU doesn't have to fetch the numbers from somewhere else and the question is now can this optimization be implemented in C Python and can we modify the pylist object structure of C Python and the answer in short is no we cannot the first problem is that to access an item of the list there is a macro in the C API called pylist get item and this macro access directly into the list access directly the obi.item member of the pylist object and this item is a pointer to a py object and the C extensions must not access the pylist object members directly because if you do that you get you expose the direct memory layout and you cannot change the layout because if your machine code is accessing directly the memory to a specific offset if you change the memory layout instead of getting a py object star you get a number or something else and your problem is going to just crash so the pylist get item macro could be modified to convert the number of a specialized list to a py object star but there is a second issue with that and the second issue is called the borrowed references if we come back to the macro pylist get item let's imagine that you implemented the specialized list so in C Python a list is made only on numbers and when you get an item of the list the macro magically creates an object on demand but if you do that you get borrowed references which means that the list doesn't know when the caller is still using the object returned by the macro or when the object can be destroyed this is a borrowed references and you must not use the py decref to decrement the reference counter of the object you must not use it on the object that means that if the macro creates an object the list doesn't know when this temporary object can be destroyed and this is an issue for the correctness but also an issue for the performance because you can keep an object longer than what it should be we don't know when the object can be destroyed and this is an issue and sadly many C functions of the C API are returning borrowed references so we can do a better C API and for that one thing to do is to make the structure opaque which means to prevent the extension to access the actually members of the structure and instead of accessing directly into the structure we have to use a function call because the function call is hiding how the members are stored in memory so for example when you create an object in Python using py long from long you get an object but currently you can directly access the obi underscore type a member of the object to get this type and this is an issue we should not be able to do that we should hide the way to retrieve the type of an object using for example py underscore type a macro the second thing to do is to remove functions borrowed references or functions which steal references which is another issue similar to borrowed references and as I said we have to replace macros with a function calls so when you build your C extension the C extension will only call functions on this function will do the work for you to hide the implementation details but by doing that there is one issue that people don't like is breaking the backward compatibility because any C API change can break an unknown number of projects and maybe not all C API design issues can be fixed we can try to fix a bunch of C extension for example the most popular C extension on PyPI but fixing all C extension on PyPI can take a lot of time or even infinite time so maybe there is another way to fix this issue there is another project called HPI which is a new C API the idea of HPI is to design the C API correct since they want correct means that you hide the implementation details and you design the C API such as it can run in the most efficient way on C Python but especially in PyPI because PyPI has a different constraint as a different memory layout and HPI if you use HPI the code is running way more efficient on PyPI the base design of HPI is that there is no more PyObject star so there is no more pointer to PyObjects this API is using PyHoundl and Houndl is an opaque integer something that you cannot inspect and you can see that such as a Unix file descriptor when you open a file on Unix you get a number and you call function you pass this number to function and the number is a private reference to an internal object in the Linux kernel or you can also see that as Windows Houndl on Windows you have also when you open a file when you create a lock when you do many operations you get an opaque Houndl and you only call functions on this Houndl and the base operation open you get a new Houndl duplicate to copy on Houndl and close the Houndl and thanks to this new API based on PyHoundl you get better performances and the nice thing is that you can you don't have to rewrite C-Python from scratch you can just take the existing C-API and implement H-Py on top of the existing C-API of C-Python and this has been done and there is a second implementation written especially for PyPy which is making an assumption on PyPI internals and thanks to that you can get more efficient code if you write it using H-Py rather than the performance that you would get if you use the existing C-API of C-Python and I think it was last year they managed to rewrite G-Zone parser using H-Py and the modified G-Zone parser is running something like four times faster on PyPy than the implementation using the existing C-API and four times faster is really interesting so it's a proof that the design makes sense and that there is another way to solve this issue and the great thing is that if your C-extension is written using C-Python there is a way to not have to modify your code because we can modify C-Python using the new PyUnder for you so using the H-Py projects so there is an ongoing work to modify C-Python to generate directly the H-Py code so you don't have to modify your C-extension so again please don't use the C-API directly just use C-Python ok we'd like to come back to an issue of C-Python which is reference counting the problem of the reference counting is that it is tied to the global interpreter lock and to to remove the global interpreter lock the reference counting becomes an issue so there is a project called a Guild Actomy started by Larry Hastings which is trying to remove the Guild from C-Python the idea is to replace a unique Guild with one lock mutable object mutable means that you can modify an object in place like a list, a dictionary or a set object and for example one way to do that is to implement the reference counting using increment on decrement instructions CPU instruction Atomic means that if two threads are running the instruction in parallel the CPU ensures that the operation there is no race condition to implement the Guild Actomy Larry tried to to create a log of decrement operation and by doing that it's possible to run the code faster in multiple threads but the implementation is quite complicated and sadly on benchmark this implementation didn't scale well instead of running code faster than C-Python at the end it was running slower than C-Python which is the opposite of the goal of the overall project so I would say that the war issue with the Guild comes from the reference counting and the reference counting doesn't scale with a number of threads so what we should try to do in C-Python is to replace the reference counting using a tracing garbage collector because many modern language implementation use a tracing garbage collector like PyPy for example the idea is to be able to move objects in memory and to identify where the objects are stored in memory and thanks to that the code is able to run faster in parallel but for the existing C-API if we move to C-Python to a tracing garbage collector thanks to the C-API they will continue to use reference counting for the external codes which will hide how the objects are stored in private, in internals internally and to finish there is one cool project called sub-interpreters there is a PEP written by Eric Snow the PEP 554 in the Stornart library the idea is to replace a single global interpreter lock with one lock per interpreter which means that you can run multiple instances of Python in parallel and by doing that each interpreter gets a full speed of Python so it becomes possible to run to run separated interpreter at a full speed and really scale with the number of CPUs there is a work-in-progress refactoring work of C-Python but it's a long-term project because it takes a lot of time to modify all the existing code and to not break the background compatibility but I think it was in last May I proved that on a machine with four physical CPUs and with four cores it is able to run the code up to two times or maybe four times faster using a modified implementation of C-Python using sub-interpreters running in parallel means one sub-interpreter per CPUs so I had an experimental implementation of that and I proved that you get a similar speed or the same speed than multiprocessing but inside the same process and the nice thing is that having a single process it's more easy to manage because managing a bunch of process can cause other issues like larger memory footprints to show you that instead of having a single process and multiple threads limited to one CPU per thread so all threads of the same interpreter you have using sub-interpreters the idea is that each interpreter has its own threads and each interpreter is as a dedicated CPUs so again you get an efficiency of 100% 5 minutes left so the expectations of sub-interpreters is to get lower memory footprints because we should be able to share more memory we should get faster locks because the locks are inside the same process it's not a locks between two process but one of the limitations of sub-interpreters is that you cannot directly exchange a Python object between two interpreters you have to design your application directly to not share objects so the summary of the future of the Python performance is that the Q&C API has design issues there is a new PyUndle API a new API called HBuy using PyUndle which is a work in progress we should try to use a tracing garbage collector for C-Biden and there is an exciting project which is to run sub-interpreters inside the same process so to conclude there were many previous optimization attempts which failed but Cytan, multiprocessing, Numba also C and REST extensions are working well to make your code faster and HBuy, Tracing Gavage Collector and sub-interpreters are very promising projects to optimize Python so thank you all, I give you links to the different projects that I listed like PyPy my notes on how to make C-Biden faster my notes on the Python C API the speed project where we are tracking the performance of Python there is also a mailing list to discuss the optimization project speed at python.org and the last link is a link to the HBuy projects but I will not have enough time for questions so if you want you can ask me questions on Zulip Dostinez and thank you