 Welcome to the first talk today on fat python by Victor's dinner and with that welcome here I'm here to give you the talk. Hi so hi my name is Victor Stiner I'm currently working for the Red Hat Company. I'm working on the OpenStack projects. One part of my work is to port the giant beast called OpenStack to Python 3. The good news is that I'm always almost done because I ported more than 90% of the projects so Python 3 is coming and I am also a Python core developer since something like five years or more and today I'm here to present you a new project called Fat Python. First I will try to explain why Python is slow and why this specific language is more difficult to optimize than some others. If you would like to say that Python is slow you must compare it to something else. A common corporate reason is to use the C language because Python sometimes is almost the same speed but in some corner cases it's up to 20 times slower and when I say the C language is the C is compiled to machine code so the code is executed by the CPU compared to Python which is interpreted. It means that you get byte code and byte code is executed by a virtual machine at least for the case of C Python and you can also compare Python to the JavaScript because JavaScript is also a dynamic language as Python but JavaScript has very efficient git compilers. You can found them in many browsers and compared to JavaScript Python is still slower but we already have much faster implementation of Python. The most famous one and the most advanced and more stable is PyPy which is here since something like 10 years. It's fully compatible with C Python. It's really fast like five or ten times faster than C Python or sometimes even more depending on the specific kind of application on your workload. You have a new project called Python which is sponsored by Dropbox. It's a fork of C Python 2.7 based on LLVM. The idea is to keep the compatibility with the C extension but in some cases try to convert the Python code to machine code to compile it to machine code. Another project is called Pygion. It's made by Microsoft. It's a little bit younger than Python. I think that Python is two years old and Pygion is one year old. Pygion is based on the Microsoft core CLR. Another kind of project is Numba. Numba is not a full implementation of Python. It's a git compiler that you have to annotate your function with something like at a git to compile it and it's specialized for numbers. For example, it's very efficient for NumPy but it's not a generic implementation of Python. You cannot make Django, for example, much faster with Numba. Another common example is Sighton. Sighton is not really an implementation of Python. It's more a compiler to taking Python source and convert the code to something like a C extension but you can also annotate the type to make even more optimization. If you start to annotate the type, it's no more Python but it looks very close to Python. The first question is why do we need a new optimizer? The fact is that I'm working on the OpenStack project and in OpenStack we are still using C Python because it's still the reference implementation and some people try to use PyPy but there is not enough support in the OpenStack community to fix some simple issues. So about Python from Durabox, the issue is that they started from Python 2 and we are all moving slowly to Python 3 and I don't plan to support Python 3 right now and Pigeon is still a little bit young and I'm not sure that the core CLR from Microsoft is really optimized for all platforms like Linux or Mac OS X so I'm trying to do my best to make C Python a little bit faster. Another fact is that Python is not always faster than C Python. A common bottleneck of PyPy is when you use a C extension because to support C extension, PyPy has to emulate the object memory. You need to have two views of the data, the optimized view of PyPy but all the way to represent data for the C extension. They also have to emulate reference content and other many complex tricks and because of that, running C extension on PyPy is slower because PyPy was written from scratch so they don't have the huge C API. Another fact is that C Python remains a reference implementation for new features. For example, if you compare Python 2.7 on the future, Python 3.6, there is a right range of new features. There are maybe 10 or 20 new modules but also new changes in the syntax like the I wait and the I sing a keyword but also the new F string in Python 3.6. Python is moving and it's moving first in the C Python. Sadly, many libraries on the application rely on C Python implementation detail. I have to put it in the first quotes because it's not really detailed. It's a little complex but the application continues to rely on them. Implementation details of C Python are for example the C API as I said but another good example is the garbage collector because in C Python we have a reference content, garbage based on the reference content. It means that when you release the data to an object, it's destroyed immediately but in PyPy they decided to use a more efficient garbage collector and the consequence is that your object may be destroyed later. You don't know when exactly and if you write your code, for example, if you open a file for reading, you put data in the file and you forgot to close it explicitly. The data may not be on the disk depending when the destructor is called. A good practice is to call the closed method or to use the context manager but there is still a lot of code in the wild which is not written correctly. For your information in Python, we now have a resource warning to detect that issue. To simplify the goal of my application of the start Python project, the idea is to replace a call to the length function, computing the length of the string ABC and replace it directly with the results number three. The goal looks quite simple but I will explain why it's not as simple as you expect. The first block of point is that everything in Python is mutable. When I say everything is just everything in the language. To give you some examples, the built-in function like the length function can be replaced at runtime. You can even modify the byte code of a function at runtime. You obviously the value of global variables change anytime. There is no such thing like constant in a better language. You cannot rely on the value of a global variable because it can change anytime. You have to reload the value each time. To give you an example of the built-in function, you can replace the length function at runtime. When you call it, instead of getting the length of the string, you get the string mock. This example is maybe not very useful but it's very common in the unit test to use the mock module to reduce the complexity of a unit test and only test one specific function. Fat Python is not my first attempt to optimize C Python. In the past, I wrote IST optimizer which is a simple IST optimizer. I also wrote register VM which is a new implementation of the loop evaluating byte code. Instead of using a stack, I use virtual registers, not CPU registers. Both projects implemented optimization like replacing length ABC with free. Because of that, I got a bad feedback on my project because it changed the Python semantic and people explained me deeply that they really want Python to remain dynamic because they choose this language because it's dynamic and because you are able to replace everything at runtime. Even if it looks ugly as a first look, in some specific cases, it's very useful to be able to modify anything. If you would like to write a new optimizer, you have to respect some rules, some constraints. The first one is to not change the Python semantic. It's something really important for the Python community. Obviously, you should not break application. It means that if you run the code using your optimizer, it should continue to work as it was without the optimizer. A good property would be to not have to modify the source code because I don't want to write something like Numba which require to put some decorators on function or do special stuff on the code. The idea is more to be able to optimize any kind of applications because I want to have the fastest language in the community. I hope that if Python becomes faster, more people will use it. Now I will present you some ideas to work around this limitation. Even if the respect, the Python semantic, but allow us to optimize the code. The first thing is to implement the optimization. In fact, to implement efficient optimization which provides visible speed up on real application and not only on a tiny micro benchmark, you have to make assumptions on the code. To make assumptions at all, it's called guards. Guards are basically a check made at runtime. For example, a guard can be a check if the built-in length function was replaced or not at runtime. A very important feature of Python is the namespaces. Namespaces are used like everywhere to store data. For example, in a module, the global variable is in a space. In the function, the local variables are stored in a namespace. In a class, it's used for class variables but also methods. For instance, it's for the attributes of the object, et cetera. Technically, a namespace in Python is a dictionary. The technical challenge to write a guard on a namespace is to have a check which is faster than a dictionary lookup because you may not know but a lookup in Python is very fast. If you would like to avoid the lookup, the check may be even faster. I propose a solution for that. It's a new PEP to add a version to dictionaries. I will detail the PEP later. A second tool to optimize the code is to specialize the code. The idea is to make some assumptions in the code and enable optimization for this assumption. It's called the code specialization. To be able to specialize the code, you will have to check guards at runtime to decide if you use the specialized code or the regular code. One example of specialization is if you have a function with two parameters x and y and the two parameters are known to be usually to be integers. You can specialize a function to work to be optimized for integers because if you know that there are integers, you can enable a lot of different optimization which are not possible in the common case when you don't know types. The third code to call a function becomes first you have to check the guards and you pass the function parameters to implement guards on the type of parameters. If the guards say everything is fine, nothing changed, you can use a specialized code. If something changed, you just fall back to the regular byte code. Python already has an optimizer called the pip hall optimizer. It's an optimizer working on the byte code. It implements only a simple optimization like constant folding, dead code elimination, optimization on gems. The annoying point is that it's written in C. Because it's written in C, it's not easy to extend it to implement new optimization. Moreover, it has a very narrow view on the byte code. You only see a few instructions before, maybe one or two instruction in advance. So you only have a very tiny knowledge of the code. For example, you don't know the world function and you don't know the world module. So you are very limited in the kind of optimization that you can do. But Python provides something more interesting called AST. AST is abstract syntax tree. When Python compiles a file to byte code, in fact, you have intermediate steps. The first one is tokenization to take letters and group them to tokens. And tokens are compiled to AST. AST is a high-level representation of the code. So it contains all information, but has a tree, which is very convenient to analyze, to process. And it also has types on nodes. So it's even more easy to analyze it. And AST is compiled to byte code. At the bottom, I show you an example of AST for the call length of the string ABC. So you can see that the call has a type call. So you can know directly that it's a call. It has two parameters, the function on arguments. The function in this case is we have to load the name length from the global or from the built-in. And there is one argument, which is a string. So you get the type string and the content. To give you the most simple AST optimizer, just to replace the call with the result, you can use AST node, AST module, which is part of the standard library. And the module has a visit method, and depending on the name of the method, you will enter one node. So in this case, we replace it with the result. Optimization. So we have guards, we have specialization. What we can do with that is that we can implement some optimization. So the following optimization are already implemented in the FAT Python project. For example, when you call a built-in function, you can replace it with the value. The idea is that instead of having to call it each time, you directly get the constant, so you don't have to compute the result every time. You can also simplify iterable. For example, replace a call to the range function with a tupper, because later, if you combine multiple optimization, it becomes much more interesting to have a constant as iterable. Yes, when you optimize a built-in function, you need a guard on the replaced built-in function. On the range, you also need a guard on the range function. Another interesting optimization is loop and routing. The idea is instead of paying the cost of the four keywords which has to create an item object, first take the first item, take the second item and continue until you get an exception, the idea is to duplicate the loop body enough time for each iteration and generate an assignment for example, x equal 1, x equal 2, x equal 3. And this optimization alone is not really interesting, but it enables even more optimization that we will see later. For example, a simpler optimization is to co-pile a constant, because here you assign the value 1 to the variable x, so you have to store the value. Just after that, you have to reload the value, because Python is a stack-based VM, so you have to push pop values every time. So to avoid the reload from the variable, you can just copy the value of the variable directly to the call. So instead of print x, you just call print 1. Constant folding is a set of operations on constant values, so integers, strings, or tuple of integers. To give you some example, if you ask for the positive value of 5, it's just number 5. If you would like to check if one element is in a list, instead of creating the list at runtime, you can convert it to a tuple, which is only built once. You have also operation on strings, operation on some strings, etc. And the latest one is interesting, because it's not a constant, it's a list, but even if it's a mutable list, you know that the result will always be the same, so you can replace the operation directly with the value. Something else is that you can avoid a load global instruction, because when you call a built-in function, like a LAN, each time you have to reload the function from the global, you have to check in the global, and after that you have to check in the built-in, because in fact the function is in built-in, so it requires two lookups, and this instruction, load global, can be replaced with load constants. It means that you have to inject the built-in function in globals at runtime, and if you do that, you avoid the two lookups. Another simple change is to remove the date code. So for example, if you have a test with an if block and an else block, but the if block is empty, you can just invert the condition and remove the if block. It's useful to avoid the jumps in the bytecode level. If you have a test and the test is known to always be false, you can just remove the final test. If you have a final instruction like return, raise, or something else, which is at the end of the control floor, you can just remove what is after the final instruction. We can know about the implementation. The good news is that I already got the changes merged in the code. The first one is a new type of IST, which is called constant. It does simplify the optimizer, because instead of having to check each time if the type is, for example, named constant num string or bytes, you have a single type, so it makes the check easier. But more over, if you have a tuple of constant object or a tuple of constant object, you can replace it once, and after that, in the optimizer, you only have to one test. Another change which was merged in Python 3.6 is to support negative line number delta, because in Python, we don't store directly the line number to each instruction, because it would cause too much memory and it's not efficient, so we store a compressed table mapping instruction offsets to line numbers, and when you implement optimization like loop and running, the line number goes backward sometimes, because we don't store the line number directly, but we store a delta, so my change just allows to store negative line number to have line number which goes backward. And the latest change is to support directly tuple on a frozen set constant in the compiler, because desoptimization already exists in the byte people optimizer, but it is implemented on the byte code and not on IST, and I would like to implement the same optimization but IST level, so with my change, you can generate directly a constant tuple of frozen sets. Okay, now I represent you three peps which are written to merge my work into C Python. The first one is to add a new version to the dictionary. The field is private, it's not visible at the Python level, only at the C level. The properties that the version is increased at every change, and the version is unique for all dictionaries, and the second property unique for all dictionaries means that you not only, you know if something change, but you also know if you are still using the same dictionary, because technically in some cases you can replace the name space of a module of a class or something else, and you would like to make sure that the name space is still the same, and using the version you can implement a guard on the name space, because on the common case if nothing changed, you just have to compare the version and you avoid the lookup. To give you an example of guard, you get the version of the dictionary, if the version is exactly the same, you avoid the lookup, you are done. Otherwise you lookup for your key, if the key is still the same, it means something else changed, it doesn't matter in our use case, so we store the new value and we are done, and otherwise it means that the value changed. But in Python, if you look at the built-in function or the class method, it's very, very rare to modify something in the name space, so the hope, the expectation is that you always go to the fast pass. The second PEP is a PEP to specialise function, it adds a new C function to the C API called py function specialise, you can use it to register a new specialise code using guards, it means that if the guards are true, you call the specialised code, and I modified the ceval.c file, which is the most important loop in Python, it's a loop which evaluates the byte code, so the change check guards, and depending on the result of the guards, you choose which code should be executed, and not only you can generate a byte code, but you can also call any kind of callable function, and you can generate specialised codes using any tool, in my case I'm using a FAT optimiser which works on the AST, but you can also imagine that you use a Cyton to generate machine code, you can use Python to generate a very optimised C++ code, or maybe you can also use the PEP in Numba to specialise code, but to keep the Python semantic, to give you an example of specialisation, instead of calling the built-in function to generate a character, you can just replace the call with the value, and when you specialise a function, you pass a guard on the built-in function, and the last PEP is a PEP for code transformers, this PEP adds a new command like option dash O, it adds a new function called sys.setCodeTransformers, a code transformer can work at the byte code level, but it also works on the AST level. For example, with my PEP, the PEP optimiser becomes a code transformer, so it becomes part of the same process, and you can even disable the PEP optimiser if you want, or use your own optimisation, which may implement more changes, more optimisation, and the question, if you want to use the PEP optimiser, you can use the PEP optimiser function, and the question, if it will happen for Python 3.6, first, I got good feedback on my free PEPs, on the project in general, but the blocker point is that people are asking me to show concrete speed-up on application, not only on microbenchmark, and sadly, to be honest, today it's only faster on microbenchmark, because I spend a lot of time just to implement guards, to implement specialisation, to modify the compiler, to support the AST optimiser, and fix some bugs, so I did not have much time to implement amazing optimisation, it's more the foundation of the project, and in my opinion, I need at least three months to implement something visible on applications, and what's coming next, so I say that we can implement more optimisation, so here are just some ideas, when you unroll a loop, when you you get this code which looks inefficient, because you assign a variable, x equal 1, x equal 2, x equal 3, but the x variable is no more used, so in this case you can just remove the x variable, because it's no more used. Another example is to copy the global, because if you know, or if you check that the global will not change, instead of having to call to load the global, each time is a function, you can just copy it in the function body, and implement more optimisation like constant folding, as usual you need a guard on the key, the global. Another important optimisation is a function inlining, because in Python the inlining as an important function call as an important cost, so instead of calling the function the id is to copy the function body where you call the function, and in this case if you combine it with other optimisation you can produce much more efficient code. And obviously you need a guard on the inline function, because if the function is modified somehow, you still have to call the original modified function. Another larger project is to implement profiling. The desktop usually is done at once in the JIT compiler. JIT compiler first profiles the code while the code is running and depending on some threshold and some triggers you can emit matching code. But I don't feel able to work to implement such thing at runtime, because it's really complex. PyPy guys took many years to implement something efficient, so my idea is to run the profiler first on a known workload, for example when you need test. So you ask me to stop, but I heard that I have 45 minutes. Do you know? Okay. Just to finish quickly I have a new perf module which is a module to implement benchmarks, and the idea is to spawn multiple processes and compute the average, because if you run a single process you only get one specific performance, but if you run it multiple time you get a better realistic value and it's very efficient to get more stable benchmark. And you can also store all data as a gizem, and thanks to that you can display, compare and analyze data afterwards. It's a library and I already modified the C Python benchmark suite to use it so we will get much more stable benchmark. Okay, here I am. Do you have any questions? We've got time for maybe two questions. Hi, thank you for the talk. What I wanted to ask you is you said that when you modify C of all to see if the guards are valid or not valid, what kind of broke up cycle that was a function was because it's not only when you get into the function you have to check the guards. You mostly have to check every time you call other functions, you call the evil or even some other things because built-ins could have changed values and a ton of other things. With FAT Python you get guards which are checked as the entry points but when you specialize the code you can inject your own guards inside the code so you are free to generate guards inside the function body to decide inside the function body if you take a fast pass for one line or fall back to the regular code. Then I have a follow-up question, why don't you check the guards inside of the specialize and then bail out of it? It's just a decision, right? It's fine. So for technical reasons it's more easy to do it like that. I think that's it. Thank you very much Victor. That was excellent.