 Hello. It's a talk of Serge Guelton. He's developer of Python and the talk is surviving in an open source niche, the Python case. Thanks. So we are back to a more intimate area, human scale, not like the one before. So I'm going to spend the next 50 minutes speaking about a project I developed. It's nothing as a scale of Python 3 or whatever. There may be less users of this project than the number of people in this room, but that may change in the future. Anyway, I think that the idea I want to share are beneficial even if you don't use the software. So a few words about me. So I'm basically a research engineer in a security firm where I do compilation stuff. I'm also a research associate researcher in a French university, a developer of the project I'm going to present to you and an LLVM committer. So what is Python? Python is a compiler, a static compiler, not just in time compilation, basic compiler just like you would use GCC for CRC++. You can use Python for Python code, but not for any Python code. It's for scientific kernels, only for the scientific word. And the input is not plain Python. Well, it is, but it's a subset, a strict subset. So there is no extension or whatever. Every code that compiles with Python is also a valid Python code, but not in the other way. So basically you write your scientific kernels. You had a few comments, so as there are comments that can be ignored or whatever and the Python code still runs, but the Python compiler understands these comments and use them to generate native code. The comments are not meant to be intrusive. There is only a few comments for function declaration, nothing at the variable declaration level or no extension or whatever, just state which function you want to export and this module will be compiled as a native module and only this function will be available in this module for this particular signature. So here I have a Rosenbrock function, I don't know what is it, but I can write it in umpy and I can export it stating that it accepts an array of integer or an array of double precision floating points. So why is it an ish? It's an ish because scientific computing is an ish. Well, there is a quite big community in scientific computing and Python, mostly due to the NumPy, SciPy, Jupyter, and Matpotlib stack and when the Python performance are not enough there is also a wide range of choice to optimize your code, ranging from Siphons that will be presented in the next talk to NumEx, Pro, NumBar, and a few others. And there is also a lot of things happening that are trying to happen in the world of compilation and Python. There are some very long-term projects like PyPy, Siphon, or Jiffon and some recent projects like NumBar or Hope and a lot of dead projects which is an int about the difficulty of the task. Copperhead to generate GPU code, Piston from Dropbox, and Eden Swole all, Parakeet, and all those projects are now dead but they tried some ideas to compile some kind of Python code. The ideas in the air when you want to compile Python codes say, okay, well, it's easy. I just add types to every function declaration so I don't have any lazy binding or dynamic dispatch. Everything will be static and then I translate that to C++ so choose your poison, it doesn't matter as long as it's a statically compiled language. And I make a lot of assumptions about the imported module. I don't care about how Python actually imports module. I use import just like I would use just include or things like that and I stick to a subset of the language which is easy to compile. And it's relatively easy to make that work. But you can also do some more advanced stuff. So that's just translation. Translation is not compilation. Translation is what cat does or said does. Compilation transforms the code. So you can take all the compiler knowledge that exists since the 50s and put it into compiling Python. There are some specificities in the language that make it an interesting target. You can use just-in-time compilation instead of ahead-of-time compilation and you can try to support a larger subset of the language, support generators, support class, meta class or what you want. And if you're really, really bold, you can try to be compatible with whatever Python code exists. Only byte code but any byte code supporting the import mechanism, lazy binding on this kind of stuff, that's what PyPy is doing. And you can also try to be compatible or to optimize native extensions. Optimizing native extension is a big deal. It's very important, especially in the scientific world because people tend to use Python as a glue and to write candles in native code and then they want to optimize cross-native codes. Python is somewhere in the advanced area. It's definitely not in the expert area and most of the basic stuff has been done. So it's an open source software. BSD started six years ago. You can find it on PyPy, on the cheese shop, on Konda or on GitHub, depending on the way you want to use it. There are some Python dependencies, networks to do some graph compiler algorithm, NumPy for all the scientific stuff, PLI, which is Lexiac in Python, for the sub-language we use in the export line, and guest, I'm going to speak about that. Victor told us that moving from Python 2 to Python 3 is a matter of syntax. But when your input is Python, moving from Python 2 to Python 3 means changing the input of your program, like changing the SQL schema or things like that. So it was actually a very difficult thing for us to move from Python 2 to Python 3. It was not just adding parenthesis around prints. It was changing the input, and when you meet the range built-in as an input, it creates a list or it creates a generator. So it's different stuff. You need to compile them differently. So that was a difficult step, and we introduced a thin layer to abstract the Python ST in that, so that's generic ST. And you also need, and that used to be a very difficult step, you need a C++ compiler, a C++11 compiler, and back in the day, it was not the case. It's actually still difficult to find a compiler because that handles every C++11 constructs. Even on Linux, with Clang or GCC, there are still some steps that are not well handled, and when you start to switch to Windows, then things start to be very difficult to support. But it's getting better because Ubuntu now shapes a decent version of GCC, so most people have a decent compiler that's no longer a source of issues on GitHub. The community is quite small, so the bus factor is of one, me, but I do receive a few contributions, sometimes from students of mine, so it's not really a choice, but they do enjoy the contribution, and sometimes from foreigners, so that's something I really appreciate because I'm not trying to grab a lot of attention, more than having fun technically and provide things that can be interesting for people as users, but sometimes users do submit a pull request, sometimes it's just to fix a typo in the readme, and sometimes they do implement stuff. Behind this compiler, there are three ideas that make it quite different from other Python compilers. The first idea is that there is no mixed-mod where Python code calling, or C code calling the Python C API lives within pure C code and everything works together. That's a siphon approach, and our approach is either you can compile to pure native code and there is no call to the Python C API apart from the frontier, or you can't compile. So that may look like a very harsh choice, but there are a lot of nice things. First, as there is no longer any Python C API call, you can release a gil. So you can make multi-freaded calls to Python generating modules and release a gil, and that works. So that's a nice point. The other point is that as you generate Python free code, you could generate a native module that can be imported from Julia or from Rust. That's not implemented, but that's something that would work. You could prototype your code in Python and then import it from Julia or Rust. That's something I want to try this year. The other point is typing. Writing a good type inference algorithm is difficult, so either you have to read a lot of bibliography, which is in itself difficult, or you have to try to reinvent the wheel, and then realize that there is a lot of literature, but that's not for nothing. Or you say, OK, duck typing, it's really similar to template C++ template instantiation. In fact, C++ templates use static polymorphism, and duck typing provides dynamic polymorphism, but in a static world, there is a match between the two. So basically, we generate C++, and it happens to be correctly typed. We don't try to infer the type of... We just generate metaprograms, C++ metaprograms, and we instantiate them for several types, and native code gets generated. That sounds magic, but it actually works. So the typing code in Python is not brain-dunmaging, and that's cool for me, because that's not my expertise area. The next idea is we don't generate low-level code. If there is no loop in the original code, we don't try to generate loop in the final code. We basically rewrote the Python built-ins, part of NumPy, random-module iter tools in C++, in a full-template and generic way, and we generate code to this high-level C++ library. The good thing is that if you want to increase the library support, you just have to write a bunch of decent C++ code for some definition of decent, but still, it's feasible. Another good thing is that as we keep the source-to-source translation at a high level, if you put... We had this feature. If you put OpenMP annotations on your Python code, then they can be translated to C++, and they still have the correct meaning. The sematic is respected, and we actually support the full OpenMP... Most of OpenMP free language from Python, and you have your multi-fledged code that runs on multiple cores. So, that sounds like a lot of stuff, but there's not that much code. So, some commits, obviously. Pythonic is a C++ layer, so it's less than half a thousand... No, 40,000, 50,000 lines of C++ code, and that's most of the code. There's a lot of tests, too, because that's a hobby. So, I don't want to spend time debugging because I have better things to do. So, if I have a lot of tests, then I can just write my code, launch the test suite, and it takes on GitHub three hours. Travis, three hours to run all the test suits, but then I can just go cooking, shopping, or whatever, and then I go back, and, okay, it works, and I can move to the other tasks. So, that's not my work. It's a hobby. So, having a lot of tests is a good way to have a hobby that can still be useful. And the Python code is actually very... Well, there is less than 20,000 lines of Python code to write this compiler. So, maybe because part of the job is moved to the C++ stuff, but also because Python is a high-level language, so writing compiler in Python does not prove to be that difficult. So, how does it work? You write your Python code as you would do. You had these Python export lines. You learn the syntax of this single line. Then you call Python. It itself generates a C++ code that can be combined with any descent C++ compiler, and it generates a native library that can be imported just like a regular module. That's part of the Python module import mechanism. But, wait, it's not a translator. That looks like a translator from Python to C++ to native code, but that's not, because recall, C++ is just a convenient backend. It's a very convenient backend, but it's just a backend. Python is a compiler. You can view it as a source-to-source toolbox. You take your Python code, move to the abstract syntax tree, refine the syntax tree and optimize it for scientific computing and then dump the result, either as Python code, which makes debugging easier, or as C++ code. So, there are three important pieces in Python. Analyzes, which try to gather information about the syntax tree. Transformation and optimization that both transforms the code, either to make it easier to analyze or to generate more optimized code. So, that's just a bunch of keywords about the kind of compiler analysis we do. Use the chains, computing the effects of a function on memory, on argument, on global memory. When you call a random function, there is a side effect on a state. So, your function is not pure. So, there are things that you can do with your function, things that you can do. You need to do a regular expression on the AST to replace an expression by a simpler one, a constant expression, which is basically equivalent to the constex keyword in C++, but down without any keywords at Python level. Laziness analysis, computing that least comprehension could be transformed into a generator expression, which avoids generating the world list. Computing whether an expression is pure, whether it has side effect or not. If it doesn't have side effects, then we can move it around. If it has no side effects and only constant parameters, then we can fold it at compile time. So, you can write Fibonacci of 20, and Python will realize that Fibonacci is a pure function and just compute that at compile time and replace the result by the variation of this function. You can compute range for some values, stating that this variable is going to be between 0 and 20, and then maybe perform some optimization based on that. This kind of optimization we do is a generalization of loop unrolling, but for any iterator. So, you can unroll a loop on a set, on a list, on a tuple, and that would work, constant folding, but interprocedurally, thanks to the analysis I put, remove some modular operation, which proved to be a very costly operation at the assembly level when it's done on induction variable, that's something people tend to do when doing image processing. Forward substitution to avoid temporaries, instruction combined to make some patterns appear. Simplify the code based on the range analysis, remove that code, all this kind of stuff can be done at the Python level. Then you say, okay, it is basically replaymenting GCC, but at the Python level, what's the use? But GCC does not understand the semantic of a Python code. It doesn't know that NumPy.1 has no side effects. So, doing this kind of stuff at the Python level, it's just because once you are at the, say, LLVM level, at the byte code level, you don't have this information anymore, so it's the right step. There is several layers of abstraction, and at the Python level layer, you can do some optimization, and I don't do register allocation at that level because that's not my job. But wait, there are other compilers, and Numba, for example, expects your code to look like that. If it looks like that, there is loop, variable declaration, and then you can compile that to efficient code. But you can also write a Fortran code that looks like that. We are in the 21th century. You can expect to write higher level code that still performs the same operation. Actually, this code can be rewritten in NumPy using that. And that's higher level. There is some temporaries that are generated because of the code to SAM or because of the array expression at the end. But that's easier to maintain. Scientific people tend to write this kind of code, and then they say, okay, I want performance, and so I remember I used to write C, and they go back to this implementation. Even in Fortran, you can write this kind of code. So there is no reason why you should stick to that. So Python tries to compile this kind of code, which proves to be more difficult in a way, but that's why you need a compiler. One of the most challenging kernel I had to handle was this one. It was on Stack Overflow, and the guy just pushed his code and said, okay, this is slow. How can I make it faster? And basically, the answer was, okay, that's high level code. Write it at a lower level. Make all the loops explicit. There is still plenty of loops. And then you can call Siphon on Numba, and it will get faster. And that's very pragmatic. I'm not stating that Siphon or Numba are bad compilers or whatever. They are very pragmatic, and they work, which is something very useful. But still, we can dream a bit, and when it's not your work, you are allowed to dream. So Python tries to optimize this kind of code, and we can reach performance similar to Numba or Siphon on this code, while not writing all the loops. So that's why we try to do, but I'm relatively alone. There are a few power users, so some people did use Python to compile codes that went into a small robot in the Baltic Sea. So there is Python code that runs under the sea, which is very cool. There is a firm in France, in Grenoble, that used Python for their daily task. Some academic works have been published using Python as an engine, not just as a subject of interest. And Martin did introduce some way to use Python in PX. And there is, also there is not a lot of developers, there is a lot of bug reporters, which is cool because when your work is useful to someone, you're happy and you're more motivated to spend a few more hours in during the night to improve your stuff. And some users' suspects are very nice. And as it's a small community, you tend to learn people to exchange not about only code but about any subjects. And I do appreciate that part of the open source life. But the road is very long. Supporting NumPy is a tremendous task. We're not supporting the World NumPy API, even if we try to improve on that. Moving from supporting only Python 2 to supporting Python 2 and Python 3 was very difficult. Supporting OS 6 was okay. Supporting Windows is getting okay since next week, not last week, but it's still difficult and it's only for Python 3 because Visual Studio for Python 2 is stuck to Visual Studio 8, 2008 and it does not support C++ 11 at all. And so where would I find the motivation to go on that way because it's not an easy way? So my opinion, when it's not your job, you have to find an interest. So either it's for fame, but that's not exactly the case. Or because it's interesting as technical challenge. And one thing that deserves Python in a way but also I find fun is you have to be good in optimization, low-level optimization, understand assembly when you try to debug vector SSE instruction. You learn a lot about the Python language because you're manipulating the AST. So you know stuff that exists in the syntax and then you say, oh, I didn't know about that. And then you dive into the language and you write a lot of modern C++ or C++ that tries to be modern. And there's a lot of stuff to learn that. So just for me, it's interesting to write this code because I learn a lot. But it's also interesting because I tend to meet other people and the scientific community in Python is very friendly. The first time I went to SciPy, it was really I used to go to academic conference and it's not saying nothing that people in academic conference are less friendly than people in SciPy. It's just not the same kind of people. So I really love speaking with these kind of people, sharing ideas and then you learn tricks and optimization and benchmarking and you just grow in knowledge and friendship with people that you would never have met otherwise. Even if your project is in a niche, there are still interesting people to meet and that's very cool. The kind of thing you discover is, okay, Jupyter exists five years ago. I didn't know about notebooks and then I saw this conference and I tried, okay, SciPy does it so we should be able to have this Jupyter magic and now there is a Python magic. You write your code, you call Python with your compiler flag, just the same as GCC flags and it generates native modules imported into the kernel and you can go on. You learn about capsule. A Python capsule is an opaque object around a native function or a native data and it just provides a minimal interface for embedding a string, an embedding pointer and it is used to pass data from native word to Python word back to the native word. Python can generate functions like here's a full function. We'll accept a pointer, a pointer to a matrix and it generates not a Python function that can be called at the Python level, but a capsule that embeds the native C++ functions hidden generated by Python and then you can pass this to SciPy as an optimization routine and it works so there is no Python glue anywhere and SciPy native code calls Python generated native code without any overhead and I didn't know about that but Martin told me about that and he said, oh, that would be cool and because we generate pure native code that's easy to do in Python and implementing that took me two days because of the original design and well, that's just things I didn't know about in Python. You also discover that there is a standard to represent floating point number, normalized but NumPy doesn't care about it which means that when you do complex numbers operation and the imaginary part is not a number and you apply this by infinity, what happens? Who cares? I don't but the standard does and NumPy people don't care either and I discovered that because my code, my native code was running slower than NumPy code and I couldn't understand, I looked at the code okay, that's complex duplication and then I looked at the binary from NumPy because most of NumPy is written in C and say, okay, that's not the same complex operation and then you discover things because Python can generate vectorized code not vectorized like vectorized array operation but vectorized like using AVX or SSEs that are available in modern processor I developed some more skills on that mostly based on Boost, SCMD but also you know, learn to debug that so that's just technical skills that happens to be funny but wait, I also have a family I'm the proud father of two lovely girls and I want to spend time with them so there's two options either I teach them Python and I try that but that was not a success or I don't spend that much time on my laptop and spend time with my children so how do you find a balance between your regular work which has nothing to do with Python or optimization your family, your health if you want to do sport or whatever and open source either you sacrifice an element which is a possibility but from my side I try to make sure that when I do open source there's a benefit also for other items when you gather more technical knowledge then you're better at work and for instance tomorrow I will present something for them related to my work but as a side effect they pay for my travel to come today so that's cool you can try to meet friends or family I have sisters that live in Brussels I'm going to meet her tonight that's cool you can try to raise money because when your wife is playing piano if you develop she's not very happy but if you develop and you get money then she's happy so if you find a way to find your work then it's more legitimate to spend your time on your laptop you may be an idealist and things that sharing knowledge is a good thing and then just speaking in front of people or teaching scientific Python to researchers in France is something you enjoy to do and then you can do that so I'm trying to mix everything so Python is just a stone maybe it's a keystone of that but it enables a lot of things and to my mind it's a pet project so it's not work so at one time I was working with one of my former students and we wanted to be very good engineers so there was a very harsh review that was ongoing for every pull request and we tried to write the best code ever and after six months motivation was gone because when you wait three weeks and refactor your code and in the end it gets in but it's one month later it's not fun at all you do that at work and you pay for that but there is a balance between the previous thing and I wanted to be correct but not too much correct I don't care if I'm not supporting this subfeature because maybe someone will notice that there is an issue and raise an issue and then I will redevelop that way it's okay to do that because it's your free time and we switched from this very harsh review to still review but lighter review and now it's healthier and I don't have a Twitter account or trying to advertise a lot just doing my pace doing it at my pace and some people tend to use it even if I'm not advertising that much so as I'm not making a living on that it's okay and I'm probably happier like that happier but I still want more I have a funding to have Python being a piece of sci-fi or a piece of sage which is a good thing for the project because it's bigger kernels so it meets the limits of Python and I also get to meet a new community and the sci-fi community was very welcoming and I really appreciated that but they were also they also had a lot of requirements they say okay why not Python probably not but why not but first you have to support OSX, Windows and Linux and then what is the size of the binary you generate because we have requirements on that okay C++ no the size is going to be huge and are there a lot of contributors okay no that's not the case but till now the binaries generated by Python are very slim because there is when you don't focus on a field and then people tell you please optimize that there is a lot of low hanging fruit so I have a blog post later on you could click on that and that explains how I made my binaries 10 times to 20 times smaller just by using the compiler the right way using C++ the right way Windows support as long as you stick to Python 3 it's not that difficult project maturity is going to be difficult but just to showcase or to express my feelings six years ago I was hired at a company and they wanted to start compiling Python projects but then I was leaving okay you're leaving well instead of doing that on our own you can do that as an open source software and we will pay you for that because we already have a grant and the job will be done and that's okay that was a very good idea very nice from them and then there is a European grant from OpenDreamkit for improving Python improving it's used to I regularly give teaching lessons about numeric Python in the group Calcule from France and I appreciate their help and I try to be active in the French community on Linux FR they are very friendly and well I just enjoy writing it helps me make my mind more clear and also some people tend to enjoy reading it so it's okay and sometimes very strange things but very happened I received this in that mail that was two years ago and that was okay hi you don't know me I read your PhD thesis when someone told you I read your PhD thesis that's strange very strange but they investigated my work and they say the thing you've been doing in Python it's nice I think you have ideas that can be helpful to us and then we work together for one year that's cool I mean that was completely unexpected but just saying thank you was nice but going that way was very nice and last December there was someone from Google Brain that just sent me an email saying okay we are using GAST your Python 2, Python 3 layer and it works, it just works we are happy, thank you and wow that's great that's the best thing you can expect so that's my little story use Python if you like to contribute to open source because it's fun and if you have any questions we have 10 minutes or so hello Stras okay very nice talk by the way I really enjoy it I would like to know since you are using NumPy do you rely on BLAST or LaPack implementation or is it pure C++ implementation no we are not that mad for all the dot operation we are falling back to the BLAST what we can do that NumPy can't do is the BLAST API is not only about matrix multiply so we match patterns and we say okay this pattern is implemented in BLAST so we generate the right call so and we just we can use BLAST or open BLAST as a backend I never tried with MKL but I assume it should work that's my next question thank you thanks for the great talk I was wondering how does Python compare to NumPy and Python in terms of performance so the question was what about performance I didn't show any benchmarks here because I want to make friends basically it depends so first NumPy is a jit so there is a jit compilation time it's cached but it depends on the usage so I will speak about Python, Siphon, NumPy and Python there are a lot of kernels that Python supports that are not supported as is by NumBar or Siphon because you have to expand the loop and in that case we generally match the NumBar performance Siphon is generally a goal so we try to be as fast as Siphon generated code while keeping a higher level input language or input and without any annotations sometimes NumPy is faster sometimes Siphon is faster sometimes Python is faster one place where we shine is when you have for instance it was at the beginning I can go back to that this function the Rosenbrock function we are especially fast on that because we know how to generate AVX instruction for this kernel and Siphon relies on GCC or CLANG to generate these instructions but it's too late because of aliasing so it's not as efficient as manually generated patterns because it's from a compiler point of view it's a difficult task it's difficult now but still difficult to generate good vectorized code I am no but not for sure that NumBar has a decorator that updates Ufx but I'm pretty sure that they don't match Python performance for that kind of kernels then on some situation for instance for the grayscott for this kernel Siphon is still slightly faster than Python but there is a lot of work to match the lower level interface with that interface NumBar supports classes Siphon does support classes Python does not because in Fortran there is Trun like Fortran and there is no class in Fortran so that's legitimate and it's not a big deal but some kernels I can't compare to them because we don't support the same input basically that's the idea we all go the same way the single thing Python has that the other don't have is native vector instruction support and in some situation compiler optimization I didn't speak about that too much but I have a blog post on that the modulo operation is not optimized by GCC or clang and if it's not optimized by clang it's not optimized by LLVM in fact and so neither Siphon nor NumBar took advantage of the optimization we developed in Python so in that specific case we are faster because we have Python specific optimization thank you I have a question about dependencies in generated C++ code what does GCC or clang compiler depend on you mentioned some algebra library BLAS I suppose standard library is there something else you need to have to be able to compile currently we depend on the standard library for all the random stuff for instance for vector implementation we did not implement that we depend on BLAS for the linear algebra and that's all and we depend on boost.cmd for an abstraction layer for vector instruction it's shipped with Python so you don't need to install it if you're concerned about dependency when I install my code then you just need modern C++ infrastructure and that's okay plus the BLAS but you have them because you have the NumPy installed probably not 100% sure and that's it I'd like to ask about do you use SIMD instructions inside C++ code? do you use intrinsics, compiler intrinsics? not intrinsics because of boost.cmd which provides target independent abstraction layer you manipulate vector data and it just generates the right intrinsics depending on current platform not for old platforms but x86 is okay and ARM for example it's probably okay but not latest instructions I'm not 100% sure thank you very interesting I have a question about array operations and NumPy so one of the performance killers when it comes to array operations is auto creation of temporaries where you will just write arithmetics on arrays so if I understand correctly the question is when you have that kind of expression that's a big array expression the last assignment in NumPy when you do that there is a temporary array that is created for each node in the expression and that's a double source of throw down the first source is from memory locality perspective it's not good because you're allocating new memory and writing to it so the data are not in the cache they are created to hold the new value and it's also bad from a loop pressure point of view because there is a loop for that operation and a loop for this one and a loop for this one so there is a lot of loops and not good memory locality and the usual way to catch this pattern in C++ it's to use expression templates so that basically you delay the duration of the expression until it's assignment and that's what we do so we have expression templates and we also but they are I think this kind of expression is relatively easy the rosin one would be more tricky what? the rosin example would be more tricky the rosin was relatively okay this one is difficult and capital U and capital V are views on other here lower case U and V are views on uppercase U and V so you have an expression that creates a view and then you use this view even just here and you update the contents of capital U through the view of smaller case U and doing that expression templates is tricky but it's also a big way to learn that you have move semantics also for member functions in C++ there's a lot of things to learn when you want to optimize this I'm not saying that the beautiful code I ever wrote but it turns out to work which is already a good thing but that's a difficult part thanks see you again and have a nice weekend