 We're all in Stephon, started creating Julia in 2009. It's great to be here. Thank you all for coming. And also thanks to Shashi for suggesting the topic for my talk. So I'm going to talk about why Julia is fast. People talk about performance in relation to Julia all the time. So I'm going to try to delve into that a bit. So ultimately, how does performance work? I mean, ultimately I think the speed of everything is kind of limited by the laws of physics. There's a speed, there's a light speed limit, right? And there's some, you know, for say your car, there's some theoretical maximum speed that can go dictated by physics. And also your CPU, there is a theoretical maximum rate in which it can do operations. So talking about performance actually, I think it's much easier to ask, instead of asking why is it fast, to ask why would it be slow? Because why aren't we just getting the maximum speed that we can get? You know, why doesn't everything just go at light speed? How come I can't drive 100 miles an hour all the time? Right, so I think that's an easier way to answer the question. So what causes these kinds of slowness? And my answer is going to be it's various forms of uncertainty is what causes slowness. It's something to do with not knowing what's going to happen, having to respond to many possibilities rather than just doing the one thing that you want to do, having to deal with many, many, you know, irrelevant things is what really slows you down. So as an example, here's something that's slow, right? This is really, really slow if you're sitting in this. And why is this slow? Well, because if you're sitting in one of those cars, you don't know what's going to happen in front of you. The person in front of you could just stop, someone could run into the street, someone could turn. All these things could happen at any time. You don't know what's going to happen. So you have to proceed very carefully to watch for all these things and check for them all the time. Right, that's why this is slow. And I think the exact same thing happens in programming languages. Instead of just doing the one thing you want to do, systems often end up having to check for lots of possibilities all the time. So in the extreme case of that in programming is if you run in an interpreter. So an interpreter is a program that takes a program as its input, and it also takes the input to that program, and it runs the program that you give it as it reads it. You know, you're feeding it the input data as it goes. So in this kind of execution, you're sort of constantly rechecking the program. You know, see maybe a change, you know, as it goes through a loop and repeats it each time. It rereads the code every time, right? Oh, maybe this time around the loop, you know, something different will happen. Maybe it'll change the program, you know, with I equals 10. No, it's still the same program. Well, with I equals 11, maybe it's going to be different. No, it's still the same program. But it has to waste its time, you know, re-under, re-interpreting and re-figuring out what the code should do. So in this kind of a system, this is sort of the maximum dynamism you could get. You could change anything about the program at any time. But you know, really how often do you need that? You don't change your program all that often. You write your program and then you run it. And while it's running, it's not changing, right? So this is something, you know, this actually has some uses, right, for some kind of interaction or debugging. You might want something like this, but in general you don't always need this. You don't need your program to be changing constantly. At some point it's fixed and you want to just run it, right? So that's sort of the most extreme example of sort of unnecessary uncertainty about what's going to happen at runtime. But then there are more, you know, more fine-grained sources of this uncertainty program as it runs. So, you know, a variable might start out as a value of one type and then based on some condition, you might change its value to something else. Its value certainly is going to change. You know, in the dynamic language like Python or Julia, even the type of variable could change. It could be integer, and now, you know, under some condition, I change it to a string. And then also, you know, if I call a function, you know, the system might not know what this function I'm calling does. So it doesn't know what it could return. Maybe now it returns an integer, but maybe the next time I call it, it will return something else. So there's uncertainty from there. There's a lot of uncertainty that comes from manipulating data structures. If I'm looking at a data structure, well, maybe somebody else is also looking at that data structure. I look at it once, but then while I'm doing something else, they might change it. So the next time I look at it, I have to, you know, check it again, you know, reload information from it. You know, even though it might not have actually changed, but you have to sort of allow for the possibility that that could have happened. So that's the source of overhead. So all these kinds of variations and updates, you know, you have to be able to reason about when those can happen. In the language and compiler in particular, reason about when those kinds of things can happen to try to cut out this kind of overhead. But it's interesting. So there are a couple of what I consider kind of important informal results in the world of programming about what happens. A lot of people have studied dynamic language programs like in Ruby and Python. There are a lot of papers about studying their behavior and people have found that in general, these programs are not as dynamic as people think. There's always a lot of regularity. Even though the system is designed to allow all kinds of runtime variation, it doesn't actually happen. In reality, people process the same types of things over and over again. So even when people use the more dynamic features, they tend to occur in the code, but they don't necessarily happen often at runtime. And so also, if you think about, you know, the idea of an inner loop, like a performance critical loop, it tends to be repetitive. And in fact, I think even saying that, it's hard to say in a way that doesn't sound like a tautology. Of course, it's a repetitive thing. It's the inner loop. It's a performance critical high iteration count loop. It's almost certainly doing something very similar on each iteration. That's just a very common pattern. So there really isn't all that much variation at runtime in running programs. So we can exploit that. So there are now a lot of systems that actually can run programs in dynamic language. It's very, very fast. JavaScript JITs are a especially good example. And they basically exploit these kinds of findings. So there's a lot of spurious uncertainty. There's, you know, in general, you think you can't predict what's going to happen, but in reality, you really can. So like one I learned about yesterday is that when you're driving through the street, there really isn't any reason to slow down because the pedestrian's going to get out of the way, right? I mean, if you die in your life, you're going to get out of the way. So, you know, while I slow down, of course they're going to move. So that's sort of, you know, that's how that works in traffic. But the same thing can happen in programs. A lot of these systems will just, you know, okay, I know in general I might not know what's going to happen in this loop, but let's just assume the best case. Let's just assume it's going to be really easy and repetitive and generate code for that and then run that and then try to back out and slam on the brake at the last minute. That doesn't turn out to work. So that's sort of the first category. And then that can only get you so far. You can get really big speed-ups from doing that. I'll show an example, but to really get to the next level and get even more performance, you have to try to remove some more uncertainty. So this is played out recently in JavaScript execution engines. JavaScript is a really dramatic example. It's a very, very dynamic language. It has this very flexible object model where every object can have any fields and you can change the fields in any way you want at any time. It doesn't even have classes, right? You can just stuff in anything to any object at any point, right? It's super dynamic. So it's a very hard case for this. But nevertheless, the jits here, these are times on this graph. So the blue and red bars are kind of the state-of-the-art JavaScript jits which use these optimistic, speculative optimizations. And indeed, you get really good performance from that. So there's a comparison here. The little yellow orange bar is the time for highly optimized Z-code to run these things. And JavaScript with these optimizations can get within a factor of 5 to 20x. And that's 5 to 20x more or less the fastest code you can get. So that's actually really good performance considering what's happening. But that's not quite enough, right? We don't want to be 5x slower. We want to be 1x slower, right? So there's been this project that takes us to the next level, this ASM.js thing. And they have the green bar, right? So they sprinkled some extra magic dust here and they actually get this down to within a factor of 2 of Z-code. And once you get to within a factor of 2 I think a lot of other factors come into play. At that point it might be just a matter of details of instruction selection and instruction scheduling can account for a factor of 2. So at that point there might not be anything left for their compiler to do except sort of routine optimizations. So how do they do that? So this is what they did to get down from 5 to 20x slower to just 2x slower. What they did was write in these little or zeros on lots of expressions in the JavaScript program. And that's a very clever hack that basically exploits the fact that that bar operator is an integer bitwise operator that only produces in 32s, 32 integers in JavaScript. So this is basically a way to force every result to be in 32. In this example it doesn't look too bad but in a bigger piece of code there's just or zero on everything. You have to put it everywhere. But when you do that and then if you write your JavaScript engine to sort of recognize that and exploit it correctly you can get very fast register size operations from all of these. So they do that and then they also add typed vibrates. So just a raise of bytes to life memory. And once you have those two things you basically have what I would call a C machine. You have effectively a processor that can run a language like C where you have an array that's the memory and you have 32 bit registers and you have arithmetic and you can do whatever you want. And indeed once they have that in JavaScript they have actually compiled huge programs, even 3D games. They compile to this and they can run it and you can get great performance. They have these amazing demos of 3D games running in the browser using this. So this is kind of surprising. So it's cool that you take a step back and I kind of go what is going on here. So we had this very high level object oriented language and we just suddenly we seem to have just thrown it all away and now we just have 32 integers and vibrates and everybody loves it because now we can have 3D games and all this crazy stuff in the browser. So how is it a revelation to have 32 integers? Does the whole world go crazy? What is going on here? And I think what this means really to me is that it's really important to have efficient abstractions. You have to have something very efficient in the language. I think this is a traumatic example of that. Just adding a little bit of the efficient building blocks that you need to something. Once you have those efficient pieces in there you can also add a little bit of the process on top of it rather than trying to start with something very high level. So in Julia essentially our approach is just to have a much more general version of that or zero thing. So we just generalize that essentially. I think historically it didn't happen in that order that we weren't directly influenced by this or anything but that's kind of what's happening. So what's the more general version of that look like? So first of all instead of having just or zero that implicitly needs in 32 we want to have a whole vocabulary of these data types available. So in this case I mean types in the sense of data types like what size number you have. So you want to have a general combinatorial vocabulary of those. And then once you have that you can have assertions and convergence. So you can declare that something has to be of a certain type or you can convert something to a certain type. So where they have or zero we would just say convert to in 32. And then clearly any other type could go where the in 32 goes. And the other thing you can do with that is have basically typed storage locations. So you can have a location in an array or an object that has a type attached to it somehow. So that every time someone stores to it or to be converted to that type. And then once any time you load something out of it you know what type it's going to be. So that sort of makes programs much more predictable. And then the last ingredient is to try to move a lot of program behavior into type based dispatch. So Julia is based on multiple dispatch. Everything is a generic function. You write essentially when you're writing your libraries you write multiple definitions for every function and you can just say what all the argument types are to make a certain definition apply. And because that has a lot of flexibility because you can talk about the types of all the arguments. The types have this combinatorial nested structure that's very expressive. So that lets you actually move a fair amount of program behavior into a regular type based system that's very easy to statically analyze. So doing things that way instead of with a lot of explicit logic and branches basically makes the program behavior more predictable to a compiler. So one of the Julia contributors Oscar Blumberg sort of summed this up nicely once he said, it's designed to make easy to write programs that are easy to statically analyze. And this is really what we're about. So Julia actually right now does not have a lot of the really fancy just-in-time compiler optimizations like a JavaScript engine has. If you sort of try to write JavaScript style code in Julia we will run it much slower than the V8 JavaScript engine does. Our answer is just to say just put a few type declarations in a couple of places and you get all the performance so you might as well just do that. But then the really neat thing though is that using type inference we don't require you to actually put types everywhere. Essentially what happens is somebody writes a type somewhere once and it just sort of propagates through the system. I can start by saying give me a 32 to begin and then everywhere I pass that if I pass it to the plus operator or some function that makes a raise that type just basically flows to all of those places and now I know all of that code those are all in 32s and it just automatically propagates so you don't need to write the type again everywhere. Compiler can do that for you. But at some point someone has to write the type. That's sort of the key. So as a bit more detail how this plays out there have been sort of traditionally two ways to represent data structures. There was the sort of the dynamic language way which is what was done in the list, the Python the core Python language not having no pi of course but everything is a pointer basically. This gives you the most dynamism possible since everything is a pointer which is a pointer to anything at any time. You have this sort of ultimate flexibility and then the language like C has much more sophisticated control of remembering layout it's much more complicated in a way because you have to talk about where are all the bits and bytes in great detail and all of that is mostly handled with compile time so you have this more complicated type system with compile time but faster run time and if I think for a long time you know an inherent dichotomy basically exactly part of the two language dichotomy that Stefan talked about in his talk you know depending on which whether you're in an assistance language or a scripting language you pick which one to use. But I think a sort of interesting piece of wisdom has accumulated gradually that's revealed I think that the dynamic language everything is just really not right it's just not optimal and this is a really nice paper that's kind of about that storage strategies for collections like dynamically typed languages this recent paper that basically showed that you really don't want that everything is a pointer style of representing things even in Python you don't want it and what they basically did they modified the PyPyGit basically what they did was add strongly typed representations for all the data structures under the hood so they added the ability to have 32 bit int array and a float array and all that kind of stuff underneath not exposed in the language the language was still just Python so they would represent things that way within the system and then just put a wrapper on top of it that gave you the Python interface to it so that involves a lot of extra complexity there's some extra dispatching that you can use to code for different arrays and sometimes you have to actually switch the representation of something something starts as an int array and then someone stores a string to it they have to change the representation so there's a lot of overhead and it's more complicated but amazingly enough it actually turns out to be worth it so even having the extra complexity specializing the storage of the data structures is such a big deal it's so useful for performance that it actually ends up being worth it to do at the end of the day the doing everything as a pointer thing is just wrong it's just not the right thing to do and in Julia I think was kind of designed to exploit this from the beginning but this paper has some very nice substantiating evidence for this here's an example of mechanically of how this works this would be loosely what would happen inside the storage strategy system and this is what happens inside Julia so in the middle there we have some memory location so this would be one of our typed locations so I have 16 bits and then somewhere there is a tag that says these 16 bits it's not going to be stored here it's going to be stored somewhere else the fact that it's 16 bit might actually only exist in compile time so that's just somehow attached to it and then we have operations that happen to this location there are stores and loads on the left and right and then there are cases where we're dealing with and we don't know what types we're dealing with so if we're generating code for the case where we have to know the types of everything of course it's very fast we have a 16 bit and we know we're storing it to a location that's 16 bits so we can just do something very fast and store it directly the same thing on the loading side if we're starting with data that we don't know that much about in that case we'll probably have some sort of boxed object like a heap allocated thing that will have a tag on it and then we sort of check the tag we check that it matches yes this is an in 16 that's good and then we have to load the data out of that and then put it in that location so there are a couple extra steps it's not too horribly slow but there's a little extra indirection there just a little bit of extra work and then finally when we if we're going to load something out of this location but we're putting the data in a place where we don't know the type then we have to actually box it because the server isn't prepared to receive exactly in 16 so we have to actually keep allocate a box for it in that situation so that is slow so that's sort of the one bad case the good thing about that case is that the box that we allocate is probably very ephemeral and ephemeral ephemeral garbage objects are actually pretty cheap so if that box is unreferenced very quickly then this actually isn't that bad but this is definitely the slow case what you see here is that you get one case that's kind of awash and you get two cases that are very very fast and then there's just one case that's bad so overall I think this turns out to be worth it it's basically what happens and this is actually this exact same scheme for this exact same scheme for specializing stuff we use the same idea for code actually as for data so in the case of data there's this idea that there's some function from a type to memory layout so starting with this in 16th thing you can find out that it's two bytes it needs two byte alignment and so on and for more complicated types there'd be more information about the sizes and offsets of everything inside of it and similarly for code starting with types you can generate machine code from that and in that case the function is a lot more complicated you have to do type inference and code generation basic ideas that some the types imply what the code has to be or the type implies what the data layout has to be and then in the data case where we have full type information you know we can do the direct store and load similarly for code if we know exactly what's happening we can do a direct call or we can even inline the code that you're calling so that's the same same pattern where in the case of data where we don't have good type information we have to box or unbox values and case of code that we're supposed to do in the dynamic dispatch but I don't know what I have and there's going to have to be some runtime lookup so there's a very very parallel structure here so overall in a system like this you get a pattern of there's specialized things and then you have occasionally there's this sort of more dynamic dispatch that goes on around it so you have sort of a dynamic glue around vision data structures or around fast compute kernels if you take a step back that's really very similar to the design of something like NumPy or MATLAB where you have a lot of pre-written fast compute kernels and then you do kind of dynamic dispatch on top of that so it's really a very similar pattern but the key difference is since this is sort of part of our language from the beginning there's no you don't have to tiously, manually separate which part is going to be in a kernel you don't have to you know systems like NumPy people have to say okay we're going to have this function this is going to be written in C and some part is written in Python you have to sort of each time you have to manually decide which thing goes where and then every time you do that you have to sort of figure out what the interface is going to be how is the high level program going to end up calling the fast thing underneath that's sort of rethought every time in our case it's basically just the types the interface is the type you can dispatch to the fast thing that handles it you know in every case and also in the when you do it the manual way you can actually end up being slower even in the dynamic case because you sometimes need multiple layers of dispatch so if this happens in NumPy for example when you do certain NumPy calls first there will be a Python method invocation which is one dynamic dispatch and then NumPy has its own dispatch tables to pick among its various compute kernels where you know and then where we can do this with just one dispatch because the one dispatch system we have is good enough to handle all of that and this kind of pattern of the dynamic dispatch over the fast code happens in a bunch of domains I think numerical computing is a big one where basically you have a lot of important compute kernels like matrix multiply where the computation inside the kernel is very very regular but sometimes selecting a kernel to run is a very irregular process and I think this also happens in the database systems where you often have dynamic schemes where you don't know the schema until run time but once you know the schema you could do something very efficient potentially we'll just show a really quick example of that let's see is it showing everything? yeah it's not I'm going to show it even more alright so what I did here just to sort of simulate this kind of dynamic situation so I created an IO buffer which just simulates some kind of a data source and then I just serialized a couple of types into it just to pretend we had something like stored on disk that tells you what the type is now what I can do in my program here at run time I can say what those types are and make a tuple type from them and then you can immediately do things with that so here I constructed this type from the data I created you can see it becomes a tuple type we can see it takes up 8 bytes and I can right away start doing things with it I can convert some other type of tuple to that type so this type thing gives you sort of a common currency once you have this and you can sort of you can spend it anywhere you can do anything in the system with it in fact just to show what happens we can say code LLVM for that convert operation I did and it will show you the LLVM code that we generated you can see we actually we get specialized code here the returns the returns and I8 and N32 just like we expect and it takes an I64 and a double since that's the argument that I gave it so basically you would get this code generated internally you gain access to it automatically just by having this type and when you construct the type and you call convert this code might be generated right then and there on demand or it might be pre-existing so if this code already existed it would just find it and run it every possible cache you can see this generates a check for overflow but then this generates very specialized code it's quite compact and so you can actually take this a bit farther that's really big now I won't bother with that okay so these I'm a little bit loose with the terminology so when I talk about these types you typically think of data types like numeric types like I832 but actually you can put a lot more information in them so you can almost put arbitrary values into what we call our types in Julia so you can actually specialize code on many many things so intuitively you have to generate specialized code for a different data type because for example the CPU has different instructions for integer operations and floating point operations so you can need different code in that case but there are cases where you need different code for all kinds of different situations and different integer value for instance so here's a function specialized for an integer 2 so this is how a function that applies another function to the numbers 1 and 2 so this is specialized for the case where n equals 2 and we're 3 then we need to do f of 1, f of 2, f of 3 and this kind of situation actually happens pretty often in mathematical computing where you need to actually generate sometimes very different code based on an integer parameter and people have used this very productively in the Julia world so this gives you kind of an interface to specialize code generation it's pretty general alright but despite that some things in Julia are still slow so right now actually higher order functions passing functions to other functions and using anonymous functions actually don't perform very well I'm working on