 Okay, hello everyone, welcome to the discussion. My name is Batran, I'm a high school student from Turkey, and in this session I'm going to talk about LLVM and Python. This talk is specific to Python and the approaches taken by Python's JIT compilers using LLVM, but there are some good points that can be used in compiling dynamic languages to LLVM IR. So we are going to start with C-Python's workflow and pipeline. You may see this relevant, but most of the applications use this workflow's pipes to generate IR, LLVM IR and JIT compile. Then we are going to see dynamic nature of C-Python, like why you shouldn't trust user code and why can't we can easily convert Python to LLVM or other representations. Then we are going to see some projects that are not currently under development and dead because of some reasons. The first one is unladen swallow, it's a Google Sponsored for C-Python, which tries to bring LLVM into Python's core, but then Google lost its interest, Google's clients lost its interest, so it's that. Then there is Py to LLVM and Py to LLVM as an old project by a new developer and it is retrieved as Py to LLVM and both are dead right now. It's a JIT compiler for Python 2 and Piston. Piston is relatively new project. It is sponsored by Dropbox, but then Dropbox switched performance critical code to other languages such as Go, and it is also dead. There are some current projects like Numba. Numba is a JIT compiler for Numpy, Python, mathematical oriented code which uses LLVM lite at the back end. LLVM lite is a Python binding to LLVM which offers a great API and it offers a way to access JIT execution engine of LLVM. In the future part, we are going to see how we can implement and develop a JIT compiler for Python in two ways. One is converting Python's AST into the LLVM IR and one is converting Python's bytecode to LLVM IR. We are going to start to see Python's workflow. See Python takes your source and generates tokens from it. Tokens are individual parts and they have no relations. Then the see Python's parsers takes the tokens and constructs a concrete syntax tree which has relationships but it is hard to work on. Then it strips all of the irrelevant information from that concrete syntax tree and constructs an AST. Then through the AST, it constructs a control flow graph like structure. There are no real control flow graphs in see Python but there are some structures that like control flow graph which generates bytecode from the AST and that structure. Then it uses that bytecode in Python's virtual machine. The AST and bytecode is suitable for coordination but the other parts aren't. Let's see them in details. The tokenization starts with reading your Python file like it should be .py file or the standard input that doesn't matter. You can read your Python file because its contents in the encoding you have specified or the UTF-8 and it tokenizes it with certain patterns like that pattern depends on the context. It knows how to split A.B. A has an individual meaning by its own. B has an individual meaning by its own but it doesn't split 1.1 because 1.1 is a float. You can use tokens to generate machine code because there are no relationships, no blocks, no nothing. It's a pretty rough format. Like this is a... Okay. Can I highlight this? Yeah. This is some tokens like 1 is number, operator is plus. I know you can't see it because of the highlight. Yeah. The number, this is 1 plus 1 but you can generate code from this because there are no relationships. Then there is a parser. Parser is a program that takes, that generates tokens and uses rules to construct a tree like structure concrete and concrete next tree. It's pretty complex because there's irrelevant information such as white space or parentheses. There are relationships but the logs aren't suitable for direct code generation and Python's parser interface is deprecated and will be removed. Python advocates to work on in ASD level. ASD is the abstract version of that concrete syntax tree. There are notes that are defined in ASDL format, Zephyr's ASDL format, which are easy to work on. If you parse a function, it says this is the return, this is the function's name, this is the annotations of the function's code, et cetera, et cetera. Python says you can manipulate ASD or generate your code from ASD. I give you some interfaces, some helper functions but don't use the parser because it's so raw and it keeps only relevant information for generating code. It doesn't keep any white space or comment. It keeps line no information for syntax errors for highlighting where this syntax error happened. This is an example of Python's node visitor interface. Whenever Python sees a constant node, it will call the visit constant handler. Whenever it sees a name node, it will call the name handler so you can generate intermediate representation here. Then after that ASD step, Python will construct code objects that contain bytecode and some information. This code object is going to execute under C Python's virtual machine, which is a stack-based virtual machine that works like you push something, you pop something with the instructions. The instructions are format like there's an opcode and oparc operation argument. There are some extra sequences. If you use a load const opcode, you specify argument as index of co-const data structure so it will make the Python faster. Unfortunately, there are no direct interfaces to assemble bytecode in Python itself. The interface assembles ASD in Python. But there are interfaces for disassembling bytecode objects, which we can use in git compiling the functions body. This is an example function that takes X as an input and compares if X is greater than four, returns power of two, et cetera, et cetera. When we disassemble this, we see instructions like this. The first one is a load first. It lost the local variable from the argument spec and they are format like this. This is jump target thing. Python stack-based virtual machine operates like this. It checks if X is greater than four. If not, it pops directly here. If it is true, it runs this place. So this is a jump target from that return power of three line. Yeah. This was the workflow of C-Python. Let's see why we can directly trust the user code because we won't give the types of the arguments or any kind of name. They are all dynamic and it will make our job harder. User can even change built-ins like user can change length function or they can even put handlers inside of class definition like when you define a class, Python loads something called build class which is a helper function and when you change build class from the built-ins module, whenever someone defines a class in Python, your function gets called and this makes things even more harder. Compared to other languages, Python, everything is mishable unless you say otherwise. Yeah. Let's see some attempts from the past that tried to integrate LLVM with Python itself. On the other side, it's a Google-sponsored fork of C-Python that tried to make a JIT compiler for C-Python itself. Its name comes from Monty Python show. It tries to make Python faster. It starts with fork of C-Python and they plan to eventually merge this into the Python score but then there are some side effects which prevent this. It was sponsored by Google and then Google lost its interest because this was a fork and deployment was hard. Google hasn't any performance critical code in Python and Google's clients don't want to allocate money and time for this. It's features. It made Python 2.5 times faster on single-thread Python code. It's single-thread. They didn't solve the global interpreter lock problem. In Python, if you have a CPU bound code, it can run... It runs one thread at a one time. It doesn't matter if you have four threads. It runs one thread because of some reference counting issues and unladen swallow didn't solve it but it makes my Python faster 2.5 times in single-thread code. Their aim was five times but there are some problems about reaching that aim. They didn't break any Python code because they were already based on C-Python itself so every C extension run as usual. They analyzed Python code before running it so they know if you change the LAN function or not so that they make some assumptions, perform it on this and some data collected at one time. Like if X function gets always integers, they said, okay, generate an intermediate representation based on that fact but if it gets a string, there's a guard which prevents these assumptions false. It converts bytecode. It opales in the bytecode level instead of ASD which is their choice. There is a hot threshold in unladen swallow so there's a cost of making a function compared to LLVM so if a function called so many times and they know so many about it, they compile it to LLVM and before that compilation, they collect data about that function. Help of the data, they know if branch never executed, they just cut out it or if they know all arguments are integer, they optimize for that but ensuring that everything works as same even the arguments changed, they put guards which has cost if mis-predictions happen and they work with linking LLVM into the Python binary. There are some cons besides of this. One is memory usage. The memory usage increased up to eight times it's not always eight times but the most part of it eight times. This is one of the costs of LLVM and other libraries they increase memory usage and if you keep different versions of same code in the memory, it will increase by default so they keep the native code, they keep the bytecode, they keep the ASD so memory is increased and there are some optimizer structures. They try to optimize Python bytecode before sending to LLVM so they create some structure which keeps Python's internal knowledge but they cost. There are some cons about startup time static linking again and when you want to compile a Python function you need to initialize LLVM and C++ initialize take some time and there are some over-attops here on time routines. Biner size increased as usual because you link it to LLVM. Why don't they reach 2.5 times they reached 2.5 times but not five times because when they was working this project was from 2009 LLVM isn't that good at just time compilation and they had some issues with that and LLVM's GTT code doesn't work well with GDB if you are developing a C extension in Python you probably want to debug it with GDB on set files and use all profile and LLVM's GTT code didn't work well so they allocate some time to fix these but they lost some gain in performance this project lasted one year and then Google lost this interest if the project continues in my opinion they can reach the five times goal easily on some codes some LLVM points about this they initially selected LLVM because they don't want to generate assembly code on every machine target LLVM was easy the optimizations and coordination was great but in 2009 it isn't directly suitable for GTT and cost of operations compiling was so heavy and without collecting some runtime data it is hard to optimize they was collecting after learning this so they spent some time to do this without collecting runtime data the next project is PyLLVM which is a compiler for subset of Python that specialize in machine learning algorithms and computations in Python mathematical code gets executed slowly because there's over a lot of dynamic things if you compile this to LLVM you get so much gain and they try to do this they have three issues they are as fast as C code in some cases that compiles with clang they only work for subset of Python if your function doesn't fit that subset they directly return your Python function if it fits they turn a wrapper around the GTT code they don't collect runtime data but they do static type inference and symbol table generation Python has a symbol table which you can use that comes with standard library but it doesn't give you anything about types so you need to create your own symbol table which records types and infers the types which could be cool they use something called LLVM PI which is a library, a wrapper library around LLVM's API, C++ API they use Python ASD model of bytecode which is more suitable in my opinion they try to convert the function body and if they see something that can be converted they raise an error and if they catch that error they return normal function if they don't see that error they just return the GTT function and they infer the types let's see Piston Piston is a JIT compiler and the runtime for Python 2 it's runtime most based on C-Pythons so they have some gain on that they use something called tracing JIT and they specialize on web applications especially dropbox itself they manage to run Django's full test suite as far as I remember they feature as they are 95% faster than C-Python in the latest relays in 2017 they have C-API support because of C-Python's runtime and in a big codebase you can run Piston with little changes over the general codebase they have more than three executors LLVM is the highest level and there is VGIT and there is ASD interpreter so if your function gets called less they interpret the functions ASD if it is called more than 25 times they JIT with VGIT and if it's called more than 1,000 times they get it with LLVM they have ASD-like control flow graphs and ASD interpreter to JIT's tracing and they use something called inline caching and rewriting of functions body and C-Python based runtime so there are some projects from current dates Numba Numba is up-to-date project that uses LLVM to JIT it's a JIT compiler it's specialized in NumPy and Python functions it knows how to boot-shot these to assembly level and gets best performance subset and it works only with type inference if it knows what types it gets it directly JITs if it doesn't know it records runtime data it's used with simple Python decor you just add something called the JIT and Numba automatically handles everything else it supports NVIDIA CUDA and AMD ROG experimentally it's almost as fast as C in some cases which is pretty good it works like this it takes Python source it uses bytecode to generate IR and then it converts that bytecode to their own specialized interminsule representation they made some assumptions based on type inference and they generate LLVM IR and JIT afterwards there's some different modes in LLVM and Numba one is no Python mode means this object never gets Python objects it always gets integers and other atomic types so if you use it you get more performance generates LLVM IR LLVM light has an interfacial LLVM and you can optimize the bytecode it finds the functions in the future we are going to see how we can write our LLVM compiler this is our target function it takes A and B and it uses Python's annotation feature which we don't handle with type inference you need to specify the types we are going to use ASD LLVM light the same light we are that Numba uses annotations this is the function we will have a compiler if we can convert the functions we will convert we will return the normal function so it will work in a subset of Python it will try to get function source code if it can it will return the normal function and if it can it will pass the source code into the ASD and it will try to traverse that ASD using not visitor class Python's ASD return if any error happens during that traversing it will return normal function so if something doesn't fit our subset we will raise the compiler and if this part catches that error it will return the function this is the ASD of that function it starts with a function definition there is a argument part there is a body part decorator part and return annotation the module is the highest level of any ASD and we are going to generate an LLVM module from the Python's ASD and set to southmod and generic visit is a function that traverses the body of given node so we will give the module and it will traverse every node that lives under the module this is the function if you want to recall the functions first of all we will remove the decorator that is used to jit our code so it won't be infinite loop then we are going to generate an argument spec from that there is a cast LLVM function which is just checks the annotation and converts an LLVM IR and then we are going to get the return type so we can convert an LLVM function type from that it takes the return type and the values of arguments then we are going to use that function type to generate the LLVM function which will live under the module we have created it will take the function type and it will have the name of the node then we are going to append the basic block which we can use that afterwards to get a builder in that basic block so we are going to traverse the node's body one by one and use that builder to build our expressions the first expression is constants LLVM types doesn't directly fit to Python ones so we need to convert it Python's integer doesn't have a specific bit length so we are going to use 32 for default and we are going to use float for our float type not double this is the cast LLVM function if the given type is integer we are going to return LLVM's int type if the float we are going to return float type and if the type is something we don't know we will raise the compiler this is the part that constructs intermediate representation LLVM's constant with the type of the IR and the value of the nodes there is the argument access part we will visit the name nodes and if we can fetch the argument given functions type spec we will fetch it and return the value if we can't we will raise the compiler because it's something global or we don't know there is binary operations we will try to get the operator from the builder object if we can we will raise the compiler because some operator we don't know and then we will return the left side and right side under that operator the return operation is pretty simple we will just call width method of builder and return the given value which is going to be binary operation so we wrote a basic compiler with a compile class method a compiler error if it doesn't fit subset and visitors for module function return binary operation name and constant the funny part is creating an execution engine we will create a default triple target that's configured at LLVM and create a target machine in the creating target machine you can specify the optimization level or anything you want then we will create the execution engine which takes bootstrapping assembly we don't have anything to bootstrap and target machine and we will return the engine this is the cheating part we will return the engine with this method and then parse the R module and verify it if it's a valid intermediate representation or not then we will add that module to R engine finalize and run the static constructors and then we will get the function as a C types function pointer and we will fetch that function pointer and cast it to C type C func type Python offers a way to access C functions with C type so there's a C type C func type which takes a type set and the type spec is given here and we will cast the function pointer to R we did this with ASD can we do this same thing with bytecode yes we will have something like this to decorate our function and we will have that same part we will keep a stack of the functions bytecode because Python's visual machine is bytecode with stack based so we will with the opcode by one by one if the opcode is something we know we will stack and if we do not know we will break the for loop and just turn to normal function loading names is simple if we see a load test bytecode we will return the given argument to stack if we see a load const we will cast it to LLVM and append to stack if we see a binary operation we will pop two things from the stack and create an operation from it and if we see a return value we will return this is the whole code to construct that IR this is it thank you for listening is there any questions ok yes ok you are asking what are things different in my JIT compilers than the projects I presented ok I need to go to the outline to list other projects ok ok there was four projects unladen swallow was using bytecode the bytecode I am using in the second approach so unladen swallow's approach and my approach is almost same ASD and in my first approach I used ASD to use that same approach piston and numba also uses bytecode so mine isn't different than others but I try to simplify the process of there in my examples in both ways ok there is a big difference compared to the other approach you are not taking care of type inference which makes things easier but when you annotate the variable with int it means python int which is actually big int and you are casting them to 32 bits in the year which is a totally different word I mentioned that it doesn't directly fit python type so for the sake of easily I casted it to 32 bit integer but in a real word application it would be nonsense to cast 32 bit int well python's typing doesn't allow to specify integers there is literals but there are nothing to specify the bit length of the integer type annotations so it would be hard to define that my compiler gets around 20% speed because I don't do any optimizations any extra thing I just convert the direct python to LLVM IR and it gets 20% speedup because for the sake of presentation I want to keep it simple so you are asking why I am using jits when there is already defined type annotations I can just convert to functions by the ahead-of-time population because in the approach taken by like unladen swallow they record runtime data I want to keep that in my slides but it doesn't fit so I have a project called swinging head which takes the exact thing like unladen swallow and of course runtime data so you are right it would make sense to not running jits but if you are going to record the runtime data or do other things you need to use jits to me ASD is much more simpler to work on because you don't need the extra bootstrap encode python offers an interface to access it but I want to show how unladen swallow and piston worked so as an example to bytecode you see using the ASD functions that's true for system library but that's also true for package lightning byte and it presents a very strong barrier to the optimizer and this so did you take that into account or when you see a call to a native function you are just calling it python way and getting back the results you are asking if I am optimizing native functions with bootstrapping some ASD. Numba does that but my jit compiler doesn't do that. Thank you