 All right. Hi, thanks for coming. Thanks for hanging out here. My name is Andy Wingo and I co-maintain Gile along with Ludwig Cortez and Mark Weaver. I've been mostly working on compiler runtime stuff, and I'm the worst maintainer in terms of bugs. Don't ask me about bugs. So this talk is going to be about Gile 3. It's an upcoming new major release in Gile. Right now we're at Gile 2.2. It's the next incremental step. It's essentially source compatible. Your programs will run in the same way only faster. So then we're going to talk a little bit more about how we got there and where we're going next. But I'm going to skip to the end of the talk already. With some results, so like if you're summing a 10 million element vector of pack floats or something, it runs 2.7 times as fast as Gile 2.2. If you take a task, which is less computational in terms of tight loops, but still very general purpose, like running the macro expander on this S-Sax source file, it's about 1.5 times as fast. And then I was going to give it a try with geeks. And I think I got it, but I don't really understand what I'm testing. So I don't know how fast it is there. But the thing is it's only going to get faster from here on. So that's really the deliverable of Gile 3, so to speak. It's just the same thing, but faster. And how do we do that? And how are we here? I didn't study computer science. I just had Gile programs that were running slow. And this is around 2006. I had audio synthesizers and different stuff. And I looked into it. And I tried everything. I would cache results, just be very lazy about computing stuff, would drop out and just see if I needed to do particular things. I built a statistical profiler. In the end, it turned out that the problem with my programs was that Gile simply ran scheme code too slow. And Gile should run scheme code faster. And that's how I picked up some compiler work that was sitting around, unmerged, and ended up maintaining the compiler in runtime in Gile. So to give you a sense, I know there's some small text here, that Gile in 2006, which is about Gile 1.8, 1.6 this time, you have your scheme source code on the top. And what would happen is at runtime, you always start from the source code. You would never cache any kind of compiled analysis about what your source code was. So you'd go through an expansion phase in which all your macros were expanded out. And then you would have this kind of primitive scheme form. That's the second box there. And at runtime, you just keep interpreting these primitive scheme forms. And we were very proud of our interpreter then. We felt like we had a fast interpreter. And we felt like we were at some kind of local maximum. And I guess we were, but it was very local. And not very maximum, you know? So what we ended up adding was we separated things into compile time and runtime. And I know this is incredibly basic here. But on this primitive scheme form, we would expand at compile time so you wouldn't have to. Also in old Gile, you had to write your macros with performance in mind. You had to write these programs that run on your programs knowing that they're going to run every time you run the program. Whereas in actuality, it's a program that runs on your program. It's a function of your program, not of when you run it. So with adding a compilation phase, we were able to run macro expansion and analysis and optimization at compile time so that at runtime, you would have a byte code, which would then be interpreted, right? So I know we have the red, green accessibility issues here. But the green things would be the expansion phase, optimization phase, the code generation to the byte code. All this happens at compile time. And then at runtime, you just interpret that byte code. And I say interpret, there's levels of interpretation in your system. So it was a byte code interpreter interpreting your program. And I know you've all seen those diagrams of ideal, Turing machines with the thing and the strip of images and whatever. So here, the machine is the virtual machine in Gile. And the strip of images is your byte code, right? And that machine was implemented by VM.C interpreting your program. And sometimes, that machine is called a virtual machine because the instructions it operates on are not machine instructions. They're somehow at a higher level. And you'll see that as Gile develops, we just keep dropping these levels down and down. It's still all virtual. And even to your CPU, the x86 instructions aren't really what your CPU runs, right? Those are expanded out as well in a similar fashion, right? So the thing is, is that when Gile becomes faster, you can write more things in Gile, right? And I got kind of interested in this. The expanse of set of programs that Gile could deal with. I don't think Geeks, a program of half a million lines of code, could be able to start up as fast as it does right now in the old kind of Gile. And then the other thing that happened as I worked on this is I got hooked, right? So happily for me, my work on the Gile compiler has, that's what I do now professionally. Currently, I'm working on SpiderMonkey and Firefox. Right. So kept on evolving. And a couple of years ago, we released Gile 2.2, which is the one that's out, right? It's the one that folks are using, especially in Geeks. And we added just one more phase in here. It turned out that the primitive scheme language that we did optimizations on in the past wasn't actually a great language for doing optimizations. So we have a continuation, passing style, intermediate representation, which we call CPSSoup. And it's kind of like SSA for people that work on compilers. And so you do that level of optimization. We still bottom out in byte code, but it's a lower level byte code. So it's a different kind of byte code in Gile 2.2 versus 2.0. But otherwise, it's similar. You can see that the tower is getting taller. And that's kind of where we're going here. And if you think about Gile 2.2, like where do we need to go? What do we need to do in the language? Like what's our goal? What's our direction? What's our purpose? I think on one side, the language itself needs to do a bit of evolving. We haven't really changed the language that Gile implements in a long time. And so we need to update a bit. And I think probably for me, we need to approach racket. We need to be closer to racket somehow. That's all front end work, mostly. I've been working mostly on the back end. Gile itself could be faster. I think more kinds of programs could be written in Gile if Gile were faster. And that's what I've been working on. Because Gile's compiler, many of you experience Gile in the form of geeks. How many of you use geeks here? Some of you? And when you run geeks pull and it has to compile all those damn things and you're waiting for a long time, many of you have this experience, I think. Specifically, when geeks is compiling the set of packages that it has in its library, which are implemented in scheme, that's running Gile's compiler, which is written in Gile itself. So speeding up Gile's compiler will speed up all instances of compilation. Users of Gile right now have sometimes a feeling of slow compilation. So speeding up Gile will make that experience better. And then otherwise, I just keep working on it because I'm kind of a junkie. So Gile in 2019, which is Gile 3, we released a pre-release a couple months ago. And this Gile 3 will come out at some point. It's just another level. And so instead of stopping at byte code, we stop at a lower level byte code and then emit corresponding machine code. So simply adding a jit to the tower. But the fact getting this to work in a maintainable way involved a number of compromises. So I want to explain them because they will affect how you work with Gile. And I want to emphasize again that it's just an incremental step. On a language level, you won't perceive essentially any change. OK. So if you want to stop interpreting virtual instructions and instead emit native instructions to have the CPU interpret those instructions, because the CPU is interpreting in the end, then it's challenging in a small project like ours. Because you don't want to have a lot of code duplication in the compiler. You don't want to add a lot of complexity. You want to keep things simple. At the same time, Gile is a very cross-platform project. People use it on really weird machines. And I want to actually keep this. I don't want to force those users away. So to implement this, and also you don't want to generate too much native code. Many of you remember the Python Unladen Swallow project from back in the day. And in the end, it failed to succeed as far as I understand because of complexity and because of code bloat. This is a thing that can happen to language implementations. So in order to meet these goals, we had two steps. One to lower to a lower level byte code than we had in Gile 2.2. And then second, to actually generate corresponding native code. The first part took much more time. The second part was quite easy. So as an example, this is at the Gile REPL. I know it's a little bit hard to read. I think the height of it is going to be the salient fact, though. At the prompt above, I disassemble a function that just references the first element of a vector. And in Gile 2.2, we assert that we have the right number of arguments coming in. We do the vector ref. We handle any interrupts if needed, which is like the stack check in JSVMs, for example. And then we return the value. So pretty straightforward. In Gile 3, it's horrible. Or good, depending on your perspective here. It's not really understandable, but it's taller. What it means is that each of these instructions does less. The set of instructions is more orthogonal so that the native code emission can be, that jit compiler can be smaller, because it has to do less reach instruction. Additionally, it exposes some more control flow that wasn't there before. And it's all just at a much lower level. And so you have instructions which are closer to machine code. It's closer to a low level virtual machine, if you will. LLVM. You have more instructions for a given program. In that byte code, you have more control flow. The compiler can do more, though. So for example, in that vector ref instruction, it has to do a number of things. It has to check that the vector is a heap object. It has to check that it has the right vector type tag. It has to check that the index is when in bounds for the vector. It has to check that the index is actually an integer, all these sort of things. In Gal 3.0 byte code, this is all separate. And so it means that if you have a hot part of your code, then the compiler can omit certain parts of these. You don't have to repeat them all. And so this prevents code bloat. But it does mean that the optimizer has a bigger program to work on. And so there's more work for the optimizer to do. So on the downside, compile time could be longer. Did we succeed? Maybe not. Because more instructions in the program, a bigger, lower level intermediate representation, means that it's more work for the compiler. And the runtime could be longer also. Because if you think in a virtual machine where you're interpreting byte code, every byte code you execute has a bit of overhead for the interpretation, for the dispatch. And so if you add more of them, you might be slowing your program down, even though each one of them is smaller. However, it's easy to generate native code for this. So for example, this is two implementations of the same thing. The top is the byte code interpreter for loading a small constant. And the bottom is emitting machine code for this. So I put a couple of things in bold that's hard to see. Basically, you have an incoming instruction, which is encoded as a 32-bit word. You have to parse out which constant you're going to load and where you're going to put it. Whereas at jit compilation time, you know exactly which constant you're going to materialize into the native machine instruction sequence. And you know exactly where you're going to put it. Likewise, you have to dispatch to the next instruction in the interpreter. Whereas with native code, you don't. You just fall through to the next thing. So although the optimizer and compiler has to do more work, the underlying engine is going to run a lot faster once jit code is generated. And so the bet is that it's going to pay off always. And in my test, it's almost always true. And I'll get to that in just a minute. We use, I decided, well, OK. We use GNU Lightning. GNU Lightning is a project which exposes an API. And when you run jit underscore mavi, it emits corresponding machine code to load an immediate into a register. And it has backends for every architecture that's used today. And so on that side, it's really good. The native code that Gile emits right now does the corresponding operations on the Gile stack that the interpreter would. And I don't have time to go into how the Gile stack is represented, but it means that every instruction, if it takes operands, it will load them from memory. And if it produces results, it will write them back to corresponding slots. Currently, there's no register allocation. It's a next step, and it's a necessary next step. However, this does mean that because there is this correspondence between every interpretation of an instruction and every jit corresponding native code for that instruction, it means you can switch between the two at any time you want. So at any time that you determine that a function is hot and you need to emit native code, you can do so and then jump into the corresponding place in that emitted machine code. And if at any time you determine, actually, I need to do some debugging. I need to set a breakpoint. I need to do whatever. You can jump down from machine code into the corresponding byte code. So on that side, we preserve a bit of simplicity on the implementation side. And the jit itself is only 5,000 lines of code, not even source lines, like fiscal lines of code. And we did pretty good in terms of number of reserve registers. There's only one that really needs to be preserved. And there's a stack register, which is the sort of base pointer for writing values. That would be reloaded, but it's usually always there as well. So the thing is, when you generate native code, when do you do it? You have lots of choices. And you generate native code with GCC, for example, ahead of time. You run your compile phase, and then at runtime there's no code generation. We can do this, and this is entirely possible. It's not yet implemented, but as I mentioned, currently, the native code that we generate is a pure function of the byte code to which it corresponds. And so we can simply cache this emitted code in the ELF file that we produce already in a separate section. As Gael's object file format is ELF, and it's one of these formats, you can have a bunch of different sections, and that's fine. And the byte code that Gael emits is done in such a way that this is not really a linking hazard. You don't have to do a lot of relocations when you load the code at runtime. But as I mentioned, it's not yet implemented. Currently, what we have is just in time code emission, JIT code emission, meaning at some point we determined that it would be a good idea to emit native code for this piece of byte code for this function and its corresponding byte code, and we do that. And specifically, we need to avoid emitting JIT code for code that's only run once, for example. Because if we emit JIT code for everything in the system, it means that stuff that's not important will undergo the cost of emitting the JIT code plus the cost of running the instruction, which is usually more than the cost of simply interpreting an instruction. And that leads to slower startup time. And I want to keep things in the 10 millisecond, 15 millisecond range. And a lot of Gael is written in itself also. This is a fundamental bit of this that doesn't apply to a lot of other language like JavaScript implementations, for example. So what we have is a counter associated with each function. The function is the unit for which we emit byte code. And this counter is incremented every time a function is called. And additionally, at any loop back edge, or at any target of a loop back edge, rather. And so when this counter overflows some threshold, then that function gets its corresponding native code emitted, and we jump into that to the corresponding native code. And currently, that tier up threshold, it's called tiering up when you move from the byte code interpreter into corresponding native code and tiering down otherwise, that's a configurable threshold. So status, where are we at? We have some impedance problems with GNU Lightning, unfortunately. So GNU Lightning, when I remembered it, when I thought about what it was, it was this project where it was almost written entirely in C++ macros, or C-macros, CPP macros. They would just emit code into a buffer when you run the thing. But it turns out that GNU Lightning had a major version change in which the API was kept mostly the same. But instead of emitting code directly, it built up a graph of nodes which it would then proceed to optimize and do register allocation for in order to optimize, especially calls. And unfortunately, that's just not what we need. I don't need this thing to do register allocation. I don't want this complexity lying underneath what I'm working on. And it crashes, and I don't understand why, and I spend a lot of time on this. And it takes time. So I need to abandon Lightning 2. I would go back to 1, except it doesn't have a lot of platform support. So currently, unfortunately, I'm looking at writing another stupid chip library. And if anyone knows of good, appropriate libraries, talk to me afterwards, and I'd be interested in hearing your experiences. And so next, the quality of the code that we emit is not great. And partly, this is because of Lightning. But it's mostly because of lack of good register allocation. And so that's a definite next step. And this is a totally well-trodden path for VMs, like doing register allocation over function unit bytecode. And I want to get us to a point at which we have consistently comparable performance to Shea. We beat them only a couple of times in a couple of the standard benchmarks now. But usually, we're four to 10 times slower than Shea. And so I want to be consistently within a factor of 1 in terms of speed compared to Shea scheme. Given that the compiler has a lower level bytecode, a lower level intermediate representation, it works with your programs on a lower level. Another obvious thing is we don't always run programs these days in the form of compiled C code, for example. Most of the programs many of us run are written in JavaScript and deployed via a web browser. And it seems pretty obvious to target the web assembly standard that probably most of you all know about. There's a proposal, which I have to check on its status, called the GC proposal, which introduces typed heap to web assembly, which is something that web assembly doesn't have yet. And I would like to depend on that if that manages to progress. Obviously, I think we need to evolve the length. Well, I don't know if it's obvious to me anyway. I would like to evolve guy a little bit more, and that means moving closer to rack it in some way. And what that means and how to do it is a long-term project, but I want to do it. And then, otherwise, Gaill is this project that I'm kind of a solitary creature. And I have had some support from work on this in a kind of 20%, 20%, 30% time. But it's a project that I enjoy doing for myself. And I perceive communication as emotional labor. Because emotional labor is anything you don't want to do, and self-care is the things you do want to do. So I need to figure out how this can scale better than just some dude hacking on the thing. But I'm really gratified, especially about the Geeks community, that you're just doing so many amazing things with Gaill without my being there at all. Right. So yeah, any questions? Check it out. We're on Gaill on FreeNode. I am there. You can ping me as Wingo. Otherwise, we'll be trying to get out some pre-releases. I said in the talk summary in spring, I don't know. I think it might be fall. But yeah, that's it. So I'll take any questions. Thank you. In back. The original Gaill garbage collector to PDWGC, how do you evaluate that decision now? So the question was, in the switch from Gaill 1.8 to 2.0 among the things that changed was we adopted the Bohm collector. And if anyone, this is a conservative garbage collector. I think I am satisfied with it right now. I'm unsatisfied with it on a peak performance level and on a pause time level. This is my performance junkie talking here. But I think these things can be fixed in the future. As we get less C code, change becomes more possible. I think it was the right choice then. But as a personal opinion, I'd be happy to talk about that later. Yeah. Yes. So you mentioned that you want to move Gaill closer to Rackett on the front, so in terms of the language itself. Yes. Rackett recently, you can also compare what you did to Shea scheme. It still happens that Rackett itself is moving to Shea scheme under the hood. So why not do that? The question is, why not? Because we're interested in moving closer to Rackett, which I think Chris will have many thoughts on. And Rackett itself is re-basing its implementation on top of Shea. Why not re-base on top of Shea? For me, I enjoy the language implementation work. And I want to beat Shea. I think I can, but we'll see. But the back end is not the incredibly interesting thing. I mean, I think it's, from a user perspective, it's the language you implement. So I want to make the language itself, I think that's a higher priority than what's running underneath is kind of an implementation detail in some way. And whether it's a good or bad implementation detail choice, well, that's arguable. But it's not essential, I don't think. One more, yeah. No, no, no. No. I mean, we have a bit of an idea in terms of calling convention between scheme code. Obviously, you need the native calling convention when you call out to generic runtime routines. But the calling convention kind of forces a bit how you think about it. Yeah, right. They did this earlier for a related presentation. I don't have huge ideas. No. I keep getting convinced by various people saying, oh, I like linear scan. Oh, I like graph coalescing. Oh, I like iterative whatever. So I don't really know. OK, manifest a micro request. Thank you very much.