 Next presenter is Ana Harlihi. She's engineer at MongoDB. Thank you. Can you hear me? So my name is Ana Harlihi, and I'm going to talk about Pi LLVM, which is a small compiler from a subset of Python to LLVM IR. That was built as part of a larger project while I was a student. So I'm not going to get up here and tell you that this is the best little compiler you've ever seen and that you should all be using it in production. But I will tell you that it is a super cool project that was built in a true open source capacity. So basically I found someone's code that was abandoned about five years ago and breathed new life into it. So if there's a moral to this talk beyond the technical content, it is that you have to be careful with your open source projects. You may think that they are dead, but someone like me will find them on the Google code archives and you will once again have to answer emails. So a little bit about me. I am a software engineer for MongoDB in Stockholm, Sweden. I work on a couple different open source Python and MongoDB libraries, but this is actually some work I did while I was not employed by MongoDB. So first, a bit about what you're in for. I'm going to talk about the motivation behind this project and the very specific requirements that we needed to meet. I'm going to go through the features, do a little bit of talking about related work, mostly Numba, which is a specializing just-in-time compiler from Continuum Analytics. I'm going to do some analysis and benchmarking before concluding. So how did I get involved in this project? Well, Tupeware is a distributed analytical framework built at Brown University that is for users who want to run analysis on large data and have their algorithms be automatically deployed on a distributed cluster. So I'm not going to go too deep into the details about how Tupeware actually works because that is an entirely different project done by entirely different people. But what I worked on was integrating Python with the Tupeware front-end. So there are plenty of distributed analytical frameworks out there, but what makes Tupeware unique is that it aims to be both language and platform independent. So what does that mean? So language and platform independence means that we want you to be able to write your algorithms in any language you want. Say you have an engineer that is convinced that they want to write their algorithms in 4chan, we want to support that. And if they have weird CPUs on their clusters, we also want to support them. So how do we do that? We do it with LLVM. So LLVM is a lot of different things. It is officially a series of compiler and toolchain technologies, but what we care about is LLVM IR, which is an intermediate representation, which basically means it is a way of representing code in between the source language and the machine dependent assembly code. So as you can see in my diagram, we have all these different languages that are all really popular, and they all can get compiled down into LLVM IR, which is the dragon is the LLVM logo. And then it once again gets compiled down into a bunch of different CPUs. So a huge benefit of Tupeware is that there are a ton of existing compilers for a bunch of different popular languages. So the goal of this project was to basically prove that you could tightly integrate a higher-ordered language such as Python with Tupeware, which is written almost entirely in C++. Usually I would go into Y Python, but since we are all here, I'm going to assume we are all properly indoctrinated. So how does it fit in with the Tupeware project itself? So you have the blue boxes, which is what the user provides. You need to provide an algorithm that you want to run on your data, and you need to provide a workflow, so like map, reduce, combine, et cetera. The yellow boxes are what I worked on, so first we needed to just be able to call the backend operators from Python, so we used boost for that, which is pretty straightforward since we were using boost for a bunch of other stuff in Tupeware. And then we needed to compile our UDS, which stands for user-defined function, into LLVM IR. This is, in my opinion, the more interesting part and also the more difficult part. So I wanted to give you an idea of what a sample Tupeware usage would look like. So you have a linear regression algorithm here. Don't pay too much attention, there's probably a typo in there. But the idea is that if you take a look at the actual code that's being compiled, you can get a pretty good sense of what you would need, what subset of Python you would need to support. You basically don't need super fancy Python functionality like decorators, list comprehensions, et cetera. What the usage of Tupeware is for is machine learning algorithms, which tend to be very easily optimizable mathematical functions. So I chose to focus first on primarily static type and variable code. And then in the red box, you have how you would use the Tupeware package. So pretty much the easiest workflow you could come up with. First, you load your data, map your algorithm, and then call execute. So one level down, if we look at the Tupeware library itself, you have, first thing you do is you try to compile the UDF using PyLVM. If that fails, then we use a backup method which basically leverages the Python C API to produce an executable. So the idea is we don't want the user to have to worry about whether or not the code they have in their UDS is technically covered under PyLVM or not. We just want them to feel like it's all covered, but most of the time we want to be able to use our primary compiler. If the backup method fails, you have bad Python in your UDF, and so we raise an error. So PyLVM itself. When I first found this project, like any good engineer, I first googled to see if hopefully somebody had done the work for me. But what I found was a project called Py2LVM, which was an unfinished proof of concept that basically showed that you could take Python syntax and convert it into LLVM IR using Python. So Py2LVM uses LLVMpy. It can get confusing because there's only so many ways you can combine Python and LLVM, but LLVMpy is a wrapper around the C++ IR builder tools. So it's basically just a way to call the LLVM tools from Python. What's cool about this project for me is that when I found it, which was about two and a half years ago, there was pretty much only this project. There wasn't really anything else out there. And when I was writing the slides, not so long ago, I did some more research and it turns out that a bunch of other people have forked the same repo and have taken it in totally different directions. So as an engineer, that's very satisfying to see how other people have solved the problems that I also have tried to. One second. So like I said before, we really wanted to be able to anticipate the common requirements of a tupleware user, so we decided to focus on easily optimized mathematical functions. As I go through each feature of PyLVM, I'm going to talk about the design decisions we made in the interest of simplicity. So someone was going to take over this project after I left and then needed to be able to read what I had written. Feasibility, which we had deadlines, mostly I needed to graduate. And then usability, which I think I might have already said, but basically want our user to be able to write Python comfortably and not have to worry too much about the syntax. So this is an overview of what the compiler's design looks like. This is where the compiler experts in the audience might get a little bored because this is all very standard. So if you want to compile code, your first step is you need to parse it and build an abstract syntax tree, which is basically a tree representation of what the code is doing. So the parsing is handled by Python's compiler package, which is very nice for me. This is where your syntactic errors would be caught in the parsing and AST building stage. So for example, say you were coding and your cat walked across your keyboard. That would be a syntactic error. Unless you have an especially adept cat. So the next step is semantic analysis. So this is where we actually dig into what the compiler is doing, or what the code is doing. So we use a visitor to visit all the nodes of the AST and then we call LVMPy for actual code generation. This is where semantic errors would be caught. So something really important about LVM IR and other intermediate representations is that it's in SSA, which stands for static single assignment. Basically, it means that if you have a register, you can assign to it only once before it is frozen. That means that if we wanted to implement our Python just based on registers alone, we would ask our users to write their Python in SSA, which is totally not reasonable. So what do we do? We use a symbol table. So symbols are basically variables. They contain the memory location where the data is stored, the name, type, and some other metadata. We keep track of our symbols using a symbol table, which we implement as a scope to make, sorry, as a stack to make scoping much easier for us. This part was implemented before I started contributing. So LVM IR provides numerical types as well as five derived types. I'm going to talk about how I represented the various Python types as LVM IR types. So since LVM IR is a statically typed representation and Python is not, it is necessary to be able to infer the types of expressions before we actually have to evaluate them. So to do this, I used the infer type function, which was partially implemented before I started contributing. But the idea is that given a node on the AST, you want to recursively descend it until you find something that tells you what the type of the expression will be. So say you reach a leaf node and it's a constant. That's pretty straightforward. You know what the type is already. Say you have a variable, then you have to look it up in the symbol table. If you have a function call, you also have to look it up in the symbol table or check with a list of intrinsic functions that are built into the language. So numerical values, pretty straightforward. Integers are integers. I added bullions. Not anything too exciting there. Next up is vectors. So we provide support for four element immutable floating point vector types, which was implemented before I started and was also a huge part of why I chose to build my project on this one. Because these vectors, which are based on the LVM IR vector type and are passed by value, are super common in machine learning algorithms. They just come up time and time again. So having this implemented for me was a huge benefit. Next up is lists. So lists are a bit of a work in progress because implementing efficient dynamically resizable lists is an open problem. So another issue with lists is like anyone who has ever programmed C before knows. If you create a list of pointers and then you return that list from a function, the pointers immediately go out of scope. So how do we deal with this? Well, I chose to just move it on to the heap and then be very careful when removing it from the stack. But garbage collection and reference counting is something that is way beyond the scope of this compiler, but could potentially be added one day. This is one of those places where we kind of went deeper and deeper into the problem before taking this stuff back and realizing this is not our battle that we want to die on. Next up is strings. So strings, luckily enough, are basically lists of characters and characters are basically integers. So we got this one for free. Next up is functions. So unfortunately, Python functions don't usually indicate what type they are returning, if anything. So we have to be able to determine what the type of a function definition will be before we actually generate the code. So this is one of the places where we do two passes. We do one pass to determine what the return type of the function is and then one pass to actually generate the code once we already know what the function signature will be. So function arguments have the same issue except it's actually much harder to infer what function arguments are given a function definition. So this is one of the places where I did what is, in my opinion, kind of a dirty hack, but I basically took advantage of the default argument syntax to allow users to indicate what type they were passing in as their parameters. The chances that somebody is writing an ML algorithm and they don't know what type that their parameters are or they need a dynamic type is pretty low. If type pins had been a thing when I wrote this, maybe I would have considered trying to use those, but it was not. So next up is intrinsic functions. A huge benefit for writing algorithms is to not have to import separate math libraries. So if you can call square root absolute value, et cetera, without having to do any importing, really nice. So the LLVM IR builder actually provides functions for square root and other things like that, but unfortunately because they are newer features and LLVM PI, which is the wrapper around the C++ tools, is an older project that is not currently supported, they're not accessible. So we could either choose to rewrite the whole library, sorry, or we could just try to get around it. So what I did to get around this problem is basically define in the header the calls to these C++ backend functions, add them to the symbol table, and at a later time it will get linked so that we can call those functions that we don't directly have access to. So this works for the math functions and it also works the same as print. Getting print working was probably one of the better feelings I have ever had, since I had already spent so much time working on this compiler, and I could tell if my executables were segfaulting, but I had no idea if they were doing what I expected. So when my mystery executables started printing stuff to standard out, stuff that indicated it was working, it felt really, really good. So branching in loops, LLVM IR has compare and jump instructions as well as relatively straightforward ways to define blocks of code. There are some idiosyncrasies, if you're curious, I recommend you look up what phi nodes are, but I'm not going to go into detail because they can be tricky. So, related work. So the obvious comparison is with Numba, which is a just-in-time compiler from continuum analytics that moves code from Python to LLVM IR. The thing about Numba is that the primary purpose is not to generate LLVM IR. It is to run your Python code really, really fast. So Numba actually moves the Python code into six different stages and LLVM IR is only one of those stages. So you actually have to do a bit of gymnastics in order to pull it out at that stage. Another big difference about Numba is that it's a jit, so it's lazy, which means that it doesn't actually compile code unless you run the code, which makes benchmarking and comparing PyLVM with Numba quite tricky because you can't run code with PyLVM, you can only compile it. So the bottom line is that Numba is a much, much more mature and comprehensive project. Like I said, I would not recommend you use this in production, but I would totally recommend you use Numba. Another huge benefit of writing our own compiler is that it was built in-house. So basically, I needed something to work on and I've always wanted to write an LLVM compiler. So for analysis, since what the project looked like before I started was entirely C++ and the user had to write their UDS and C++, what I wanted to make sure was that by adding this Python front end, I didn't make it harder on the user. So I picked four sample algorithms and took a look at what the resulting usage would look like for C++ versus Python. As you can see, it's not very different, which is actually kind of a good thing. So the huge benefit of using Python is not because you would expect your algorithms to be shorter, it's that you're freed from worrying about memory management and pointers and things like that. So for actual benchmarking, we wanted to compare PyLVM and Numba, but unfortunately, because of the JIT thing, it's complicated. And instead of going into the data, since I would have to do a lot of explaining, I'm just going to tell you that Numba is slightly faster and PyLVM is slightly slower, but it's only by like two or three times. It's not in order of magnitude, which is important because compared to the cost of analyzing the workflow and deploying to the distributed clusters and actually doing the analysis, the compilation from Python to LLVM IR happens once. So it's very, very small relative to everything else. So as long as we're not in order of magnitude slower, then we're fine. The other thing I wanted to compare was the actual generated LLVM IR from Clang, which is what we were using before, and PyLVM. So something else about the way that LLVM IR works is that you generate unoptimized LLVM IR before passing it into the backend. And these incredibly powerful optimization passes are going to happen before it's converted into machine code. So the idea is that if no matter who generates the LLVM IR, it should be, the surface level differences should pretty much be wiped out by those optimization passes. So we used those four sample algorithms. We compared the runtimes and the results are here. The blue is PyLVM, the red is Clang, and the y-axis is time. So as you can see pretty much across the board, PyLVM is a bit slower than Clang, which is actually pretty good considering Clang is a huge project that is very mature and has a lot of resources. So the actual results, we range from quite slow to not that bad, but K-Means has a particularly large spike. So I wanted to figure out why K-Means was so much slower. Turns out that my ingenious header hack actually probably cost us the performance because I had assumed that calling these header functions would get optimized away, but what it looks like is that the optimization passes weren't actually able to get rid of that one level of indirection. So every time you call square root instead of just calling square root like you would in Clang, you call a function which then calls square root again. So that's a problem because we can't rewrite LLVM Py and we can't contribute to it since the project is long dead. So not much to do there except for maybe pull out LLVM entirely, which is a big undertaking and since I don't actively work on this project anymore, if one of you want to do it, that's awesome and I totally encourage you to do it. So last to conclude, overall, we were pretty successful in proving that you could tightly integrate Python with this distributed analytical framework. There is still a ton of future work to be done and like I mentioned, I'm not doing it right now. I'm currently working on other MongoDB stuff, but if anybody is interested in contributing, I really encourage it. Also, if anybody is curious about a relatively simple compiler, I recommend you look up the source code because I find it pretty easy to follow. I have a lot of people to thank. Thank you to Tim who is my advisor, Alex, Andrew and Kaihan who are the grad students who wrote Tupleware. Thank you to the people who wrote Pi2 LLVM. I hope you don't mind that I've been mucking around in your code and thank you to all my mentors at MongoDB for encouraging me to go out and give talks. So I have some time for questions, but I'm also going to make a brief and somewhat entirely unrelated announcement. I'm currently working on a library for reading data directly from Bison, which is the MongoDB data format, directly into NumPy. So if you use NumPy and MongoDB or one or none of them, or you want to tell me about something, I would really appreciate any feedback. So please come talk to me or tweet at me. Thank you. Hi, so thanks for the talk. I just wanted to mention that in Python 2.7 you do have type hinting, although in dog strings. Cool. I totally missed that two and a half years ago when I started on this project, but maybe it's not too late to add it. Yeah. You said strings were lists of characters. Does that make them mutable in your LLVM implementation? Pretty much. Yeah, the only time that the real distinction comes out between strings and lists is in the print function, so we know whether to print characters or integers. Hey, thanks for the talk. I was very curious, did you try benchmarking compared to the scikit-learn implementations of the same models? No, I didn't. I actually, I looked into it, but was immediately overwhelmed by just how, like, large the project is and how moving Python to unoptimized LLVM IR seemed like very hard for me to pull out, so I eventually just decided to focus on Numba instead. Did you consider using the LLVM lights library the Numba guys have written for talking to or for generating IR code? Yes, that's on the list of things that we want to replace LLVM pi with, but when I picked up this project it was already using LLVM pi and not LLVM light, so I figured that it would take a significant amount of time to swap out the existing IR builder findings, and so decided that my time would be better spent adding features. Okay. More questions? Okay, thank you very much. Thank you.