 OK. Yeah, hello, everybody. My name is Kai. I'm a very casual contributor to LLVM. I haven't done last year something with LLVM, but the years before. I also was the maintainer of LDC. That's the LLVM-based decompiler. So I have some experience working with D. Yeah, and today I will talk about my latest project, which is my own compiler. And for sure, compiler construction is a very, very big topic. So if you know, for example, the Dragonbook, it's, I think, 1,000 and more pages long. So in 30 minutes, I don't have time to tell everything about compiler construction. So I concentrate on the topic which interests me most, and this is the generation of the intermediate representation for LLVM, because that's the main interface to LLVM. OK, and the big question is, if you look at the output of compilers, for example, Kleng or LDC or the Rust compiler, the generated intermediate representation looks always a bit different. And sometimes it looks more elegant. Sometimes there are constructs in it which seems to be irate with them, which could be optimized away. And my question was, why does this happen? Why do some compilers generate a lovely, beautiful intermediate representation? And why do others generate some junk with it? By the way, it does not really matter, because LLVM optimizes this all the way. But my question is, why does this happen? And to study this question, it's obviously good to have a compiler. And yeah, it should be for a real language, a compiler for a real language, but also not too complicated, because I'm the only person working with it, so it must be somewhat manageable by one person. But this does rule out C++, for example. And I looked a bit around. And I also looked at programming languages. I had experience with this. And I came up with a very old language for my compiler. I choose Modular 2. OK, so that's indeed very old. The first implementation of Modular 2 was in 1979. It was done by Niklas Wirt. I think everybody knows him as one of the main contributors to structured programming. It was developed in Swiss, in Zurich. So it's a real European language. And it has some beautiful characteristics. So the first one is it has a very carefully designed syntax. For example, if you ever tried parsing C++ or C, you remember the dangling else problem. If you look at the else, you don't know to which if the else belong. That's not the problem with Modular 2. The syntax is designed that it's very clear to which if the else belong. There's also something revolutionary. Modular 2 has a module concept. For example, compare this with C++ today. It's a working module concept. And it dates even back to a language called Modular. And it was taken from there. Yeah, and it's a system programming language. So there are low level facilities integrated. You also have procedure types, which is basically pointer to procedures. And yeah, it was quite popular in the 80s and in the 90s. What they did in Zurich first was they wrote an operating system in this language. So this proves that's a real system programming language. And it's called the Lilith operating system. Never seen it, only know it from literature. There were some big toolkit constructed with Modular 2. I know the compiler toolbox from the Gesellschaft for Mathematik and Datenverarbeitung in Germany. It's also called Cocktail. It was once open source. I think today they license it. And it was also quite popular for Windows programming. So if you search on the internet, you find a lot of Windows programs written all in Modular 2. So it's a real programming language. But it's also simple enough to create a compiler on my own for it. I concentrate on the, let's say, original language definition, which was defined by Wirt in a book called Programming in Modular 2. The fourth edition also abbreviated as PIM4. And later there was an extension. The ISO Institute, they created a working group and they worked on standardization on Modular 2. And they also defined some new features of the language, for example, exception handling or an object-oriented facility with classes. But I'm just looking first at the original definition. For those of you who never heard of Modular 2 or never have seen it, I just cut an example from the book. It's a simple calculation of the greatest common divisor and the LCM. Nothing fancy, but this gives you a flair of the language. So we have modules. We can import from modules. And yeah, we have these standard things like if, while, and so on. So it's a very nice language. OK. So my first approach, how to deal with the compiler. Hopefully everybody knows that the normal compiler has different phases or stages. So I start with a source file in Modular 2. And the first part is the parsing and to generate an abstract syntax tree. And for the parsing, I currently use AuntLR, which is a compiler construction tool, which is very easy to use. It's written in Java, but it can output code for Java, C++, Python, and a bunch of languages. And for my purpose, it has a very nice property. So when you do parsing, you get a parse tree. And with AuntLR, you can define some rewriting rules so that you also can define your abstract syntax tree. So in parsing, for example, you see the keyword if, then comes an expression, and then comes the keyword then. And the then is completely redundant in your in-memory representation. So with the abstract syntax tree, you just get rid of this. So this is very easily done with AuntLR. It's a nice tool for doing this. Then I have a semantic phase where I do some type resolving and other checks. That's currently very crappy. What I do is I construct an in-hand-coded abstract syntax tree for the purpose of the code generation. Because my main goal is I want to get rid of the AuntLR tool. It's an additional dependency. And I do not want to have this dependency. And there again, Modular 2 is a very, very easy programming language. You can code recursive tests in tether by hand. It's not very complicated. But just for speed, I currently use the AuntLR because I like to study the generation of the LLVM IR. And so the parts before that is currently very, very crappy. I haven't published source for it yet. There's already a GitHub repository. And as soon as I think it's code, which I think someone can look at it, I will publish it under this URL. But for now, it's very, very crappy. I'm not proud of this code. So be patient. OK. So my main focus is the code generation. I have a decorated abstract syntax tree, decorated because I always done type resolution, scope resolution. And so I have everything in place. And from this, I now want to generate LLVM intermediate representation. And the building of the LLVM interface intermediate representation. And the building block for this intermediate representation are the basic blocks. Every time I generate an instruction, for example, an edge or a return or a branch, it goes into a basic block. That's more or less the container for the instructions. But with a very, very sharp restriction on it, because every basic block is single entry and single exit. So it's just a linear flow of instructions. And there can be several jump into these basic blocks. And at the end, there's a terminating instruction, some kind of branch or invoke of method or so. And there the basic block is left. And inside the basic block, there's just one single flow. So this is a simple example of how it looks like. In this case, it's taken from Clang. But it looks with some tweaks in it. But that's the look of a basic block. It's a building block because most optimizations are done with basic blocks. So if you look inside LLVM, there's a lot of things done on optimization. And the main focus is always the look on the basic block. And there's another property you can also see with LLVM. A procedure or a function consists of several or a lot of basic blocks. And these blocks form a so-called control flow graph. If you have used the LLVM tools, there are some command line switches where you can generate a graphical view of this control flow graph using the dot program. That's very nice if you do some debugging or so. It's worth looking at these graphs to better understand what it's going on there. So now it's clear where the instructions have to go. And sometimes I'm a very simple person. And I like to do things simple. So my first approach is how to generate the code. I have a basic block. I generate a basic block. I have the pointer to this basic block. And then I generate the instructions into it. How does it go? OK, some code here. I have a node. That is a node of my abstract syntax tree. It's here for a procedure. It's very, very simplistic here. I have a name. I have the inner block. That's the list of statements. And I can accept a visitor. And what is my visitor? OK, the visitor. It has to create an LLVM function type. It says to create the LLVM functions. When I have this, I can create my basic block. And it's always nice to use this ER builder because it makes the things a little bit easier. And then I have my basic block. And what I would do now is with the visitor just generate the code into this basic block. And I already said, a basic block, it's always single entry, single exit. So this approach means, sometimes during the generation, I have to create new basic blocks and to switch the pointer in the visitor. But that should be no real problem, hopefully. OK, and hopefully what happens, this approach works very well. If I generate code for something i plus 1 for an assignment, but as soon as I start nesting structures, for example, if then else nested together, I get into trouble. And why do I get into trouble? Let's have a look at the inner if. There's the b greater than c condition. OK, that's an expression. I generate the code for it. And then comes the if. So I have to generate a compare. And then comes the branch. And this branch ends my current basic block. And I have to generate more or less three new basic blocks, one for the then part, one for the else part, and one for everything which comes after n. So the condition is one basic block. This comment is one basic block. This comment is one basic block. And then I have to continue after it, so I need a basic block there. And OK, I can do this in the visitor instruction. But when I do this, then I have basically empty blocks because for this end, I generate a basic block. I have after the then part is finished, I have to jump to this basic block. This one is empty, and I have to put something into it. And what I have to do is I have to jump after the outer end. So with this approach, what I do generate is basically this basic block. I have just a label and a jump to another label, which is useless, looks ugly, and this is exactly the thing I do not want to generate. From my memory, Clang do not generate such code. But for LDC, I know instances where we generate such code. And that's my motivation. Where does it come from? And this is a very simple example while you have to need the need to generate such code. Yeah, a bit more of insight about it. Why does this happen? If you think a bit about it, then it becomes very clear. The abstract syntax tree is a representation of the syntax of the text file. It's not the text file itself. It's abstracted because I have looked for the keywords and separated all this, but it's still very close to the textual representation. On the other hand, the basic blocks, they form the control flow graph. That's a single, each basic block, as I explained, it's a single flow of instructions. And then there are jumps, or if you create the graph, there are some connectors to the next basic blocks. It does not look as easy or as simple as I have put it here. It's much more complicated. Just think of a switch, for example. But it's very clear from the graphical representation there's a mismatch between both. And if I start on the left side to generate the right side, it will not be as easy as I have tried with my visitor. So my next question is, can I fix this? Yeah. The first one is LLVM is a very good tool. And so one approach to do this is just configure the optimization process in LLVM to do some jump resolutions, put it under the 0, 0 option, and then it's gone. So let LLVM do the work. That's very simple. But my personal opinion is it's still ugly. I do not want this. There. I came up with another solution. The main problem is that when I generate, I always have, with a jump, I need a target. And I somehow need to know what is the target. And in the case of the nested if-then-and construct, I don't know what is really the target for my jump. And therefore, I generate an empty basic block. And what I try to do is just induce the control flow graph on my abstract syntax tree, which simply means for me that I have to add a new pointer to the s-notes, which represent a statement. And I called this exit 2. So this is the pointer to the statement which starts the next basic block. And this can be constructed with a recursive visitor very fast. That's OK for Modular 2. With this, I can solve my problem. By the way, there's another construct loop and exit construct, which is similar to a while and break construct in C++. And there you need to do some scope resolution. If you enter the, or if you see the exit, you also have to know which loop to end. And you need to be a bit careful about this. But basically, I can construct this with a visitor. I have then an additional annotation in my abstract syntax tree. And with this information, I can get rid of the basic blocks. OK, that's a very restrictive implementation. What you can also do is just let's go to the full-blown control flow graph. It can be constructed from the abstract syntax tree. There are algorithms for it. And then I have a control flow graph. And I can generate the intermediate representation from LLVM of it. That's very, very costly if you do it only to get rid of some basic blocks and other actually generated instructions. But nevertheless, this can be good to do. So what I like to do is I have my abstract syntax tree. And I want to transform it in a kind of high-level flow graph. And first, how to do it? The first thing is I get rid of some higher-level constructs. So every programming language has some syntactic sugar, how it is called, which just means you have a statement which can be made up of lower-level statements. For example, we have the while loop. And the while loop can be replaced with a loop, which is an endless loop and an exit. So for example, here on the right side, there's the while with the condition. And this can be replaced with a loop and an exit inside. And this is a more lower-level construct. And I can do this in model R2 for the for loop, the while loop, and the repeat loop. And also, the Boolean conditions end and all, they have a short circuit evaluation, which is defined in the language as being an if-then-else construct. So I can also replace this with a nested if-then-else. Then I have formed a lower-level, in this case, all still syntax graph. And then I can go and replace all the implicit jumps with explicit labels and go-tos. And when I do this, then I have basically constructed a control flow graph. There are two things which here happen here. The first one is if you take this root, think about your debug metadata. Because in the source, there's the while statement. And then you do something on your syntax tree. And suddenly, there's a loop statement. You still have to remember that was the while and there was the condition. And so this can be very tricky. Yeah. OK. And when you do this, what you have done is you have created more or less your own intermediate representation. It's higher than the intermediate representation of LLVM. But it's a bit lower than your abstract syntax tree. And I think two years ago now, there was a big announcement from the Rust compiler people. They made a big statement. They have created their own intermediate representation. It's called Mir. And it was a very big news hype, at least I saw it as a news hype. And I was always interested what have they really done. And they have done this. They created a control program with some lower level constructs and called this their own intermediate representation. It's a very valid approach. Yeah. And when do I want to do this? This is very helpful. For example, if I have to do some type checking or scope checking, so if I'm still working on the semantics of the language, it can also be helpful if I generate a lot of synthetic code. For example, clean up and less for exception handling or some other fancy stuff which might happen. In this case, it's very good if I don't have to work on the syntax tree because then I always have to use the construct of the source language with the control program. I can introduce my own instructions. For example, the go-to. This makes these things much, much easier. And this was for the Rust people the reason to do it because they had to do some type checking. And it was very, very difficult to do it on the abstract syntax tree, but with this control flow graph, it was much, much easier. Therefore, they introduced it. When I look at my problem, my modular 2 compiler, then this currently seems to be overkill. With my approach, with this additional pointer to the next basic block, I'm currently fine. This might change when I look at the standard modular 2, which introduced exception handling because then there's an additional out arc of a basic block. You can jump to the exception handler. And so this can be changed there. But currently, for the fourth edition of programming in modular, it's really not required. And from my point of view, so my recommendation, just be careful and check if you need to do this complicated stuff because what does LLVM do? I generate the intermediate representation. I give it to LLVM when the first part of LLVM constructs a control flow graph from this representation because all the optimizations works on this data structure. And so it might also be worse if you think you have to add something like this. Maybe you can add a new optimization path to LLVM to just do this stuff. That's also possible. And for example, in LDC, we do this when we create scopes. Then we have to allocate memory. And we try to deallocate these memory or put it on the stack. And there's an extra LLVM path to do this. It could also be done at a higher level, but it can be done with LLVM and just use the framework you have. Yeah, that's it for me. So thanks for listening. And when there are questions, feel free to ask. The module of two LLVM bindings, and then you can make your compiler self-hosting. So the question is, if there will be some modular two bindings for LLVM, and I say no, because that's a lot of work. So what I currently do is, obviously, I write it in C++. I haven't said it, but the source code was C++. And that's really the easiest way, in my opinion, to work with LLVM. I had a look at the C interface. And if I would do some modular two binding, I obviously would use the C interface. And it's not that complete and not that easy to use. So I really prefer the C++ interface, so no, there will not be bindings for LLVM. So is that why you didn't use D? I don't use D in this case, because I just wanted to have very little dependency. So my, let's say, internal goal is to have only the same, or only to have LLVM dependency. That's also the reason I wanted to get rid of RntLR. It's not, the tool is very cool. And I have used it for several other projects, for example, writing in SQL parser. And I only can recommend this tool. But my goal is to just reduce dependencies. And D would be another dependency. I have a new compiler. I have a new library. And therefore, I currently stick to C++. And I also want to train my C++ coding a bit. Do you have an opinion whether to create allocas in your early on, or have memturec or something to generate those in LLVM? Because that's whatever I went back and forth. So the question is, if I have an opinion, where to generate alloca instructions in LLVM? And I currently do it first, but I don't have a real opinion on it. So my main concern about the generation is I do not want to generate useless instructions. The alloca, I have to do it somewhere. So you can debate where to do it, but you have to do it. But the empty, basic blocks, that's the thing they have, a very strong opinion about it. I don't want to have it. So this was my main focus. The other thing, it's debatable rest. And I really have no opinion about it. Yeah, everything else? Was there anything from the Modula 2 language that didn't map well onto LLVM IR? I haven't found yet one. It's, for me, the most difficult thing was to understand the scoping of. And yes, there are the scopes, because they have, they really thought about scaling. If you create a language, you have to think about how to scale it for large programs. And they had the same thoughts. And they created inner modules, which is basically like a namespace in it, but with the module name for it. And I had some problems with the scope resolution, but that has nothing to do with LLVM. What I have not yet tried is, and I don't know how it fit, because I really haven't looked at it, Modula 2 has also defined the module for coroutines. And there are some new instructions for coroutines in LLVM. But I haven't looked at this. I'm still building the compiler for. But this would be very interesting to see how this fits. Yes? The question is, if there are some undefined behavior in Modula 2, which can be exploited for optimizations, I'm not really sure. There are areas which are not, in my opinion, not clearly defined. For example, what happens if you have nested loops with while loops and then exit in it? What is really the scope resolution there? I'm not sure if this can be exploited. There are also hints. When you have a for loop, like in C++, you have a variable for this loop. And the report states that after the loop, the value of the loop variable is undefined. But clearly, it has the last value of it. So I'm thinking about it's worth to generate a warning about this because it says, OK, you're no exploiting something which is undefined. But again, I'm not sure if this is really that exploitable. So the language itself is very cleanly defined. I really appreciate this. OK, could you please repeat? Hi. I'm using your browser. OK. So other candidates I have considered had looked at some subset of C. That is also from my experience, once upon a time, there was a compiler called small C, which has a very restricted set of C constructs. And I also looked at this. But it's also an ancient language. I also thought about having some subset of C++, but I ruled this out. So C and C++, I ruled out because of the preprocessor. I wanted to have something where I have one compiler program and not to think about an additional layer of the language. So with this, I looked at small C, but I ruled it out because of the preprocessor. And that's basically all I have looked at, because then I thought about Modular 2 and Founder. That's very nice. It's a simple language, but it's still useful. So that was the main aspect for me. How are you going to get rid of Antler? The way it is done with Clang. So the syntax of Modular 2 is basically LL1, I think, with one exception. And therefore, I will create recursive tests in parser by hand. That's not very fancy stuff, I agree. Compiler constructor or parser construction tool kit is much better there, but it's also very easy to do. Syntax is not as complicated as C++, but it will be a handwritten recursive test in parser. Is what you are expecting? Yeah, yeah. So the question is, how do I test that the outcome is what I expect? Currently, there are no tests. But what I intend to do is use the LLVM tooling. There's this lit test to try to use this. And obviously, I will try to create some kind of unit test framework in Modular 2 to run the things. But yes, it's very ashamed. There are currently no tests. Our coroutines. I haven't looked. So I also said it before. The coroutines module, I haven't looked at it yet. I'm still looking at the core of the language. But coroutines are very interesting because there are some additional LLVM instructions for coroutines. And for sure, when I come to this point, I like to exploit these instructions. But currently, I'm just looking at the core. Thank you very much for running out of time for this session. Thank you, Kyle. OK, thank you.