 All right. Good afternoon. Thank you for coming. I'm Eugene and I will talk about my project that I have been working on for some time. So, what is the problem? So, we have this interpreters and compilers, right? So, we know these methods and use them frequently. So, interpretation is very, so both of these methods are good in their own ways, right? So, the interpreters, the main strengths is that they are very close to program semantics. So, if we are implementing a program in language, it has some rules that it is for evaluation of the language. They are very cleanly expressed as an interpreter, right? That means that leads to very clean code. So, we can explore the interpreter, we can debug it very easily, we can test it, right? But on the other hand, it has some downsides. So, by its nature, it has to wrap all of the instructions in some glue code, and that leads to the overhead, right? So, how do we do that? We write compilers instead and compilers are awesome. They basically operate in two stages. So, they split execution into the compiler stage and the execution stage, right? So, the compiler stage now takes only the source program, right? And it produces some kind of target code that attains the rest of the arguments to this program, right? And that's all good, but unfortunately, compilers are also harder to develop and maintain because they require a specific style of coding, where we don't talk about the operations, we talk about ways to generate these operations, right? And also, they require use of some complex libraries and programming APIs, right? And no matter how easy a compiler infrastructure makes it to write compilers, it will always be harder than to write an interpreter, right? So, we can't pick and choose different components or different advantages of these approaches. So, we have to either pick an interpreter and we get clean code, but slow execution. Or we can pick a compiler and we stuck with a code that is harder to grasp, right? But what if we could pick and choose? So, what if we could pick direct a translation semantics property and that would lead to code that is easy to explore, to debug, to extend, but also get the multi-stage execution with the corresponding properties for performance? And luckily, this still exists for quite some time since the 70s and it's called partial evaluation. So, what is a partial evaluator? It is a program that takes not the program and some of the arguments to the program. And it produces a representation of another program that takes the rest of the arguments. So, the arguments are split into two groups, right? The first group is passed to the partial evaluator and the second group is passed to the program produced by this partial evaluator. And we already see that if we apply it to our interpreter with the source program as one group and arguments is the other group, what we get is pretty close to multi-stage execution that we used a compiler for, right? And all of the, so here, all of the analysis we're done at this first stage that it takes only the source program and the compiler stage, it doesn't have access to the program and it doesn't have any analysis and transformation passes. And here we can achieve the same result. We can embed all of the analysis and transformations in the first stage and the compiler program will be free of these extra overhead layers, right? So, right, well, first the remark it requires all the arguments to be separated, right? So, if we want to apply this method to any interpreter we need to specify which arguments go to which stage, okay? And here we say that source program is stage zero or static as it's traditionally called and arguments are in stage one or dynamic, right? We have to make some additions to that, some adjustments. So, first of all, we don't really want this mix here so this partial evaluation because the program here, the source program is just another argument to our interpreter. And since the main application of this is for JITs, right? The program will be bound at runtime. We don't know it in advance, right? But here in the above picture we have to partially evaluate the interpreter every time to a reprogram, right? And that is its own level of interaction and its own interpretation overhead. That is exactly what we want to avoid. So, instead we're going to use a specialized generator in the picture below, right? So here we have this generator that produces a two-stage program, right? So, we can run it at compile time and produce our two-stage specializer that then it takes the source program in one stage, the arguments in the second stage and produces the result, right? Yeah, and also why not generalize to multiple stages, right? And for example, we could have a system where that processes some queries to a database, for example, right? And it can have multiple levels of parameters, right? So, in one stage we can have, like, prepare statements. So it is a broad description or a plan of a query or of some kind of query or procedure that we want to interpret. And we can already specialize our interpreter with respect to this prepare statement. And then for each query, the parameters that we get, like from a system of some component, we can emit code pretty fast and the resulting code will be specialized to our query. And so what we want is we want for our generator to emit a multi-stage program, right? Okay, so let's build it, right? For a LLVM, right? So why this is the focus of the project, right? It's kind of complicated here, but this is LLVM, right? So, why LLVM? First of all, there are some existing partial evaluators for some languages like C and ML family languages, but they are all kind of forced to implement the same algorithm, essentially, all right? And what is the better place to implement this algorithm than in a language independent optimizer, right? So let's put it there, right? And that will enable lots of languages. So with little additions to the front ends of these languages, we can take advantage of this common functionality in the optimizer. And what we get is a way to build a multi-stage generator for each of these languages, all right? So the project is an extension to LLVM, right? It has some attributes and intrinsics, and the main part is the two passes. So we have an analysis pass and a transformation pass, right? So first of all, how do we represent annotations? Well, as I said, we have to annotate the program in order to tell the system which arguments go to which stage, right? So we have to, what we want is to split a function into like, in this case, two functions. One will take Y and the other will take X, right? So here we add some attributes to the parameters, right? So we say that X is stage one, it goes there. The result is also a stage one, and we also have the attribute on the function itself. That may seem redundant, but it's not because it is natural to wish for the last stage to not take any arguments at all. So for example, we can have a function at the very last stage that doesn't take any arguments, but instead it accesses some memories and or like it may call some external functions for certain operations, right? For example, it can drive to disk or like store some, it's a memory and we compose a string of letters or something like that. Right, what is the interface here? So the intrinsic takes a function pointer and the arguments, right? And it produces the program. And in clan, right? So there is some extension to clan. So we have this attribute on the function, declaration, and the result of this function attribute will be the body for this function. So we don't have a body in the source code, but a clan will generate as four, will generate one for us using the intrinsic. Right, and how do we use this? Well, we say we have some interpreter in C, right? So we draw some interpreter for language or maybe it's something like an interpreter like maybe we match some grammars. So we have some grammar descriptions and a string, right? So we add the attributes to the arguments to declare which arguments go to which stage, right? And then we add this function declaration with the mix attribute that references the function, right? And then we need to have some layer that translates the representations of programs into object code, right? So we can just work for that. And of course we can, so the function will create IR representation of a specializer in some context. And we can manipulate this context, we can create like in our application either a single context for everything, for everything or multiple context for different situations. So it's very flexible in this case. Yeah, so now we'll talk about implementation for a bit. So here we have, so what happens in clan is that it creates this body using the intrinsic, right? Next, the intrinsic in the optimizer will be replaced expanded into a function call. So this function is actually like the main power horse. So it takes the stage 0 arguments and it returns another function that takes a stage one arguments and so on, right? So this is actually our specializer, right? And at the high level what we want is we want to split a function that takes arguments of various stages into a chain of functions, right? Each function will take arguments of each stage and return the representation of the next function that takes arguments of the next stage and so on, right? So we apply a certain transformation a number of times and all of this, all this transformation does is that it fills off the arguments of the last stage, right? It removes them from the original function and it creates, it replaces the result with the function or representation of the function that takes this field of arguments and returns the old result, right? Let's go down to IR. So yeah, we have a function F here with the arguments X, Y and Z. The arguments that is in the second stage, stage two, right? And after one transformation we get this function G that doesn't have that but instead it returns a value pointer, right? Which is actually a function pointer, right? And this function G creates this new function that takes Z, right? And you may also notice this context argument which is used for accessing the current IR module context and so on. Okay, after the next step we get this function H that takes arguments in one stage. So this is one missing argument here, it's X, right? So it doesn't have Y anymore, it only has X. Okay, on the level of a basic block, so on the left we have an example of some control flow in our source program and if we, on the right is the example basic block of this specializer. So in the example we have three blocks, A, B and C. Block A is in stage zero and two instructions in stage C, in basic block C are also in stage zero, right? And everything else is in stage one, so dynamic. And what we see in the generated code here is that we have one block for A because A is the only static block and we have only two instructions that are copied from the source, from the original function and the rest is replaced with calls to the MC API, right? To recreate this control flow graph. So we have these calls and basic block here that create, these calls create blocks B and C and we have some instructions like build binary operations here that creates the add, yeah, the add in block B. All right, and in general, all static basic blocks are transformed into basic blocks in this specializer. All instructions that static or offstage less than the current stage are replaced, are copied as is essentially and everything else is replaced with calls to LVMC API, right, to recreate this in the next stage. Yeah, so with calls, so we would like to support not only one function but a set of functions because the interpreters are written as a set of functions typically, no, yeah. And what we want is we want to to stage all these functions at once. So first of all, we gather the set of functions that form the interpreter, then we apply this transformation once so we lower all the functions at once and while doing that, we replace the calls with pairs of calls. So the first call will go to the function that is a specializer for this stage. So here we have a call to f.mix here but this is a generated function that specializes this function f, right? And it returns a function for the next stages, right? So it only takes x here and it returns a function that takes y. And so we build calls in the next stage that call this returned function on the previous stage, right? Yeah, so we have the pair of calls. This is the normal call and we have a long build call here to recreate this call later. So what happens to the control flow graph? Yeah, so we have some control flow graph at the beginning. It is truncated, but you can see the idea. So we have some basic blocks. So green blocks are of stage zero here, red blocks of stage one and white of stage two. And what happens after one transformation we eliminate all white blocks, we fold them into the block A, A1 here or like C1 here, right? So on the next stage we fold all the red blocks that are left into block A0, right? So block A0 will create these blocks. Actually what happens is that it won't recreate this graph exactly, it will unfold it, right? It will create the control flow graph from the basic blocks of the corresponding blocks here, unfolded blocks, right? So here we have not the loop, but the unfolded loop with two iterations. And on the next stage we get a similar kind of graph, right? So now we talk about buying time analysis. So we assumed, so the previous section talked about how we separate one function into a chain of functions, right? Now I'm going to talk about how do we know which instructions go into which stage, right? So how do we actually know which instruction built where or when, right? And so for basic blocks, right? So this analysis it provides just that information so it assigns an unsigned like stage number to every instruction into basic blocks, right? And the way it works is that first of all, it assigns stages to parameters, like if parameter is declared with a stage attribute like all parameters are, right? For this to work, it assigns that number, right? Or if we have an external code that we know nothing about, we can assign it to the last stage, right? And then it runs this minimum fixed point algorithm to propagate the stages to all of the other values in the function, right? And it can reach a contradiction, so it can learn that the annotations are inconsistent and it will report an error. Or it can find an ambiguity in trying to resolve this errors and it will report them, yeah? So here's the example with optimized. So which instruction gets a stage and blocks also get a stage, yeah? And yeah, these are the rules that it tries to satisfy. I won't talk about all of them. But for example, the rule on the right most rule at the top says that the stage of an instruction should be the same or greater than the stage of the argument, which makes sense because the instruction that depends on the current must be executed at least then or later, right? But I want to stop it in one rule in particular, right? So this rule states that if we start from some stage number, right? And from any basic block. And we trace the branches from this block, yeah. So we should have exactly one atteminator that we can reach through atteminators of greater stages. So of all, a dynamic stages, right? Yeah, so here on the left we have an example. We have basic block A, right? And there is only one atteminator that is reachable from block A on atteminators of stage one, right? So if we traverse the graph starting from block A and go in on all edges with a stage one or greater, then we reach exactly one atteminator. It's in block C. And this atteminator goes right here. So it is atteminator of the specializer, right? If we had two, we couldn't build this specializer this way. Yeah, and we can prove termination. So if the resource program terminates, then the specializer also terminates. And this is, it's very easy to remove some of the rules and to get to the situation where we have a specializer that doesn't terminate, but the source program did, right? And that would be a problem. Okay, so the approach so far works great until it doesn't work. And this is because it really, it is really strict about mixing static and dynamic computation, right? So here we have an example. So we have a function eval that takes some pointer to a node, right? And you can think of it as an ast node. So we know everything about this node at compile time. But here we get an error because the loop in this function f goes through all the nodes until the node returns a nonzero value. And since this condition is dynamic, the branch is also dynamic and the few functions in this loop header are also dynamic, right? This falls from the rules of binding times, right? And it means that we have a contradiction here because the parameter to eval is of stage zero as declared, but the argument is of stage one, right? So how do we deal with that? Well, we added, we had it built in. So the problem was that this call, this call, it combined both the static and dynamic parts, right? So we evalated both calls at once in one block and controlled by the same condition, right? But we can separate it into two parts, right? So we can have one loop that is very static, right? That goes through all the nodes and stores, yeah. And it only performs the call to the specializer and it returns the function pointer to the next stage, okay? So we can store this function pointers in a very, very decoray in C and in the next loop, we can call these pointers instead of the eval function. And this way on this line, we can specialize eval to every node, right? And here we just call these function pointers, right? So one caveat, this doesn't work with the normal code gen because it requires a gen, right? So this means that previously we could maintain the same code base for our interpreter and for the specializer. But now we, in some cases, we have to diverge, we have to introduce another function or two just for a specializer to deal with some issues, right? This is the residual code, right? So the first loop is completely unrolled and the second is a normal loop. So now I'm talking about some examples. So this is the best case sort of. So we have some expression tree of operations and we specialize it to some particular expression, right? And on the left, we have a recursive function with the switch, right? And on the right, we have a single basic block. Yeah, so, and the similar example with conversions. So we have our four nested loops and we can replace the two innermost loops with one basic block that just applies that is very specialized to our kernel. Okay, so it only has like nine operations. Okay, string formatting, similar. Yeah, so in these cases, we have some improvement. So here, like in three or four times or even like bigger, yeah. Now I'm going to show some of the... Yeah, the example with the dynamic control problem. So here we have a bytecode interpreter and it has some instructions that operate on some register machine. So we have some registers, we have some operations to load a value into a register to add them and so on. And all is good until we have this jump instruction. Jump on a condition, right? And it exemplifies this dynamic control because now this block approaches either to the end of the loop header or to this block with this operation. And this jumps back to the header. So from this operation, we kind of have two static determinators that jump to the loop header. And we can have two because we have to pick one to move it to the specializer, right? So what happens is that nothing here gets in the counterflow optimized, right? And we get, yeah. And we get on the modest improvement, right? But let's go back. So how can we deal with that? Let's apply our built-in. Let's first display it into two functions. The first function just returns the offset to the next instruction, right? And we have our dynamically controlled loop here. Okay, so let's apply our trick with the built-in. Right, so we have now the array of function pointers and we call this built-in, right? And now on this line, this loop is completely unrolled and we store, we specialize this event instruction with respect to every instruction in our sequence of instructions, right? And then we call our function pointers in a loop, right? So this is an improvement over the interpreter, right? But it's nothing of an improvement over the original interpreter. It is slower, right? So let's go back again, right? So here is our test program, the Fibonacci program that takes, yeah, so. And the first four instructions are shown on the right. So this is the specialized code to each of these four instructions. And we may know that each of these instructions, aside from the last one, jumps to the next, right? And this is very inefficient because each of these instructions or functions are very like small, right? But we have lots of function calls, a function calls between them that are indirect, right? So let's go back. We can change our function in C. We are staying in the interpreter length, right? So everything here, it can be run as an interpreter aside from the drive function, right? So if we specialize that, we get this basic block which I can call the basic block because it is actually a translation of the basic block of our byte code program. Yeah, and we got an improvement over the original, right? So we got 36 seconds in our optimized version and 87 from before. All right, this is a toy example, but it shows some of the features and some of the features of our interpreters that we have to deal with and how can we deal with them in this way? Okay, so these are all toy examples. I want to apply this to some kind of real program, right? Some size, right? And there's also one thing that with annotations because we have to annotate every function that we, with our interpreter calls and that we want to specialize, right? Instead, what if we have some inference inter-procedural that we can apply to propagate these annotations across functions? Okay, and yeah, we have only one front-end for C. Like, it's very easy to write because it doesn't have to do anything. Just have to translate these attributes down to IR, right? But I only have one front-end to show and yeah, that must be it, yeah. Okay, we have any questions, okay? Other, any bounds on the increase in code size? That's a special answer, right? In code size, yes, there is an increase. In code size, I didn't measure it, but yeah, there is an increase. That's more, is there anything limiting the increase? So the question is, is there something limiting the increase in code size? No. Currently, there is nothing like that. So the code size can explode significantly. Yeah, this is a problem, but actually. Any other question? Okay. Yeah, so I have a Game Boy emulator and a brain fuck interpreter implemented exactly like how you had an example. So theoretically, I could use LLVM-NICS to specialize the interpreter and wind up with like a much more specialized block of like for the E-Val routine. Yeah, so the question is whether this can be applied to a brain fuck interpreter that is written in this way. Yes. Or even like a Game Boy emulator or other kind of emulators. I think so, yes. Like, are we happy to help if there is anything and any problem with that? I should know that this is all like a prototype so it doesn't work like on all optimization levels and so it doesn't kind of like a debug info and stuff, but yeah, this is just a proof of concept of such a tool. Okay, any other question? Okay. You did say you plan to next look at real-world examples. Yeah. Do you think is a real-world example that requires the least amount of work to go from where you are now to improve a real-world example? And what do you think the hardest real-world examples might be? Okay, the question is what are the potential real-world examples that they can start from? And well, previously I have been working for a query compilation in databases, right? So this is kind of inspired by the domain. So I think I will start from some kind of small database engine or some query engine and try to specialize that, right? Because they have this program that is a SQL query and they have an interpreter that interprets this program. So it's kind of a fit for this approach. So for example, I think Postgres now has some Palo Alto generator bolt and does it make that better or? Ah, well. I think it would be an alternative. I wrote that. Oh, okay. I think it would be an alternative because we manually generate the IR and do the specialization by hand basically. Right. But there isn't an interpreter that you could enroll with the input program in theory, yeah? Yeah, so there is an engine compiler in the latest Postgres version, yeah, for expressions and for queries, even, like some parts. And of course, this only works with compilers. I'm sorry, with interpreters. So we can take the earlier interpreter that there was in Postgres, or there may be still is, right? And we can work with that. And actually, like one of the early versions of this tool was specifically built for the Postgres, yeah. But it was built differently. It wasn't like, it was a different tool, right? Any other question? Right. Termination, what are the directs that somebody programmed for your specializing to terminate? It will terminate, so the question is, what are the some prerequisites or conditions on the program that are needed for termination? And the answer is, if it passes the binding time analysis without errors, right? So then it will terminate, right? Because the analysis enforces a set of rules, right? And the termination is a condition on these rules, right? So as long as these rules hold true, then the specializer will terminate, right? But it is very possible to try it on patients in such a way that they are inconsistent, right? So we can pass a dynamic argument to a function that takes an argument in an earlier stage, right? So it is possible to write this code, but it won't pass the checks in the analysis, right? And as soon as it passes, like, it will terminate then after that. Okay, any other question? Previous attempts? Yeah, there is, so the question is, if this relates to previous attempts for partial evaluation in LVM, like LPE. Yes, there is LPE, which is built a little bit differently, like very differently. It works without the analysis part. So we cannot, the difference as I can describe it is that the LPE, it works in a single stage. It is an online partial evaluator, right? So we don't have the intermediate annotated representation to explore what parts will be executed in which stage, right? And the second part is that LPE doesn't have memory annotations, and it is hard for them to add them because it kind of requires this compiler integration, right? And what this project has is you can annotate some types or some fields in some structs or some variables in such a way that we can partially evaluate laws and stores to these variables, right? And in general, LPE is more focused for input and output partial evaluations of some files or some like input that the program can take and be specialized to. And this is kind of more, it has a different focus. It is a focus more on interpreters that are not a standalone programs but are parts of some other program, yeah. Yeah, okay. Have you done any analysis, how much performance you gained by doing the specialization in multiple passes rather than one combined pass? Well, so the question is what are specialization benefits of having multiple stages, right? And it all depends on application, right? So if we take this, for example, for example, with query a compiler, right? If we take multiple parameters for one query, then it is profitable to specialize it multiple times, right? And each time we do it, we either pay this additional overhead if we don't have the previous stage in between. So can maybe jump to the slide, right? So it only makes sense to add, for example, stage P1 here. If the code that is returned from the stage will be executed more than once. If it is executed a thousand times, then it is obviously more profitable to add this stage here. And likewise, for stage P0, it is only profitable to add this stage if this stage will be executed many times, right? So with query processing, we can have one query that is a parameterized multiple times in some session or some execution session. And then each query will then take multiple parameters from the database. And so it has this kind of structure, right? It all depends on application. In some applications, it doesn't make sense to split these stages because they are not split in real life, like you have one block of arguments that you pass at one point in time, yeah. Time is up, but there are small questions for the speaker. I'm sure you can find them afterwards. Okay. Thank you very much.