 Thank you. Can everybody still hear me? Yeah. Okay, so just a quick overview. Essentially, I'm going to give you probably the most detailed look into Hello World you've probably had since you first wrote Hello World. I'm effectively looking at how you would write a very, very basic compiler that compiles down to JVM bytecode and runs on the JVM. Okay, so as the slide says, I'm a senior consultant for Shine Technologies. We've got officers in Melbourne and Brisbane. We've just recently opened a Brisbane office. Me personally, I'm a hobbyist compiler enthusiast. I don't actually work with compilers in the field. It's just something that I do for kicks. My first real exposure to compilers was probably the C Python code base where I was sort of responsible for implementing the try-except-finally syntax. For those of you who aren't too familiar with Python, up until Python 2.5, you can only do try-except or try-finally. You couldn't actually use both in the same try suite. It was just really nasty and it really irritated me, so that was heaps of fun. Then in 2.6, it sounds really, really scary. The compilation of ASTs within Python. So essentially, an abstract syntax tree is like an in-memory logical representation of a program. You basically take source code, run it through a scanner and pass it, and you get an AST at the end. That sort of represents your program. Basically, implemented code that could compile ASTs down into your Python bytecode. So blah, blah, blah, blah. Essentially, all of my background, as far as compilers are concerned, is with Python. So I'm here talking about the JVM. The reason for that is essentially just for kicks. As I said, I'm an enthusiast. Enjoy playing around with these different things. So yes, if you're going to write a compiler that targets the JVM, you sort of have to run it well. Why would I do that? I mean, isn't Java slow? Java's not really slow. It's start of time. It's a little bit to be desired. It's probably my biggest issue with the JVM. But long-running processes, it's pretty highly optimized. The stuff I'm going to show you today is probably not going to be a long-running process. But still, it should give you some idea of how you've got about doing this stuff. I've mentioned memory management here. It's really, really nice when you're working with compilers. You don't have to worry about memory management in general. Writing compilers in C can be a little bit hairy in that regard. And the JVM is enterprise-friendly. It means you can compile to class files and deploy it to JVM-friendly enterprises and they're not going to complain about it. Just recently we've had to deploy some PHP stuff and we had to jump through a few hoops just to make that happen. So JVM, that sort of stuff doesn't sort of become an issue too often, unless you're in a Microsoft shop. Okay, so compiler construction 101. This is sort of a generalized compiler architecture. Not every compiler you look at will follow this exact structure, but it'll be some slight variation of it, generally speaking. So you've got a scanner which takes your source code, which might be Java or C++ or whatever, converts it into tokens so it identifies words within the source code stream. So that's an if keyword, that's a bracket, that's a string, that sort of thing. And then that stream of tokens is fed into a parser and the parser takes that stream of tokens and effectively structures it. And the way it structures it tends to be into a tree sort of fall. So in my particular case I'm going to use an abstract syntax tree, but back in the day they used to also use a thing called a parse tree, which was sort of more at the grammar level than the logic level, which is what the IST is sort of more targeting. Hopefully that makes sense. Then the IST is passed into a code generator, which generates your target code. The cool thing about the IST is the sort of language and, well, source language and target language neutral. So you can effectively take the IST and write a JVM code generator and then write a .NET code generator or whatever you can think of, JavaScript, whatever, which is sort of the power of the IST. Okay, so hello world. Can everybody understand this program? Yeah, so hopefully that's pretty straightforward. But what I want to do is sort of break this down in terms of how a compiler, well how a scanner and a parser would look at this. So effectively a scanner would go through and say, okay, well we've got this identifier or keyword here, followed by a bracket, followed by a string, followed by another bracket, followed by a semicolon. And that's essentially the tokens that sort of get fed into the parser. So if we take that up another level, we can express a put s statement as the put s keyword, I guess you could call it, followed by a bracket, followed by a string, followed by another bracket, followed by a semicolon. Whereas a string is defined as the regular expression that starts with a quote as anything that isn't a quote in between it and ends with a quote. Obviously you can't escape strings in this particular compiler, but well, escape quotes. So I mean that's all well and good, but if we wanted to make this parser a little bit more useful, a little bit more general, we could take the put s statement and turn it into a put s function call. So all of a sudden that put s bracket blar is just a function call to a function called put s. So we say, okay, well, a call consists of a name, which is an ID which matches that regular expression down there, which looks really scary, but it effectively starts with any alphabetical character or an underscore, followed by any alphabetical character or an underscore or numbers. Hopefully that makes sense. Obviously not unicode friendly, but you know, so okay, our function can take a string. We could probably take that a step further and say, okay, well, the call takes any number of arguments, which can be any type of expression, and an expression can be a string or a call or a name technically as well would probably work. And then up the top I've just defined a statement. So we say because we've defined an expression as potentially being a call, we don't want to have to put a semicolon at the end of every call that we make. Hopefully that makes sense. So you start out with these really simple sort of grammars and you build on them and expand upon them and you generalize them, and all of a sudden you've got a much more expressive sort of language that you can easily go and implement. And the way that you can implement this stuff at the scanner and pass level anyway is using Scala's pass combinators, which are effectively a DSL for describing passes in Scala. So you can sort of say I expect this token, I expect the keyword followed by a bracket, followed by an expression, and then based on those pass results you can construct an AST or you can execute an action like, you can actually evaluate programs as you go along if you're feeling masochistic. And yeah, AST, abstract industry. So Scala's pass combinators look like this down the bottom. So you've got a grammar, an expression is a string or a function call and down the bottom, it's more or less identical to the grammar. Back up the top again, you've again got a call is a name followed by a bracket, followed by args, followed by another bracket, followed by a semicolon, and again down the bottom. That's effectively combining the tokens as you go along. So you're sort of saying, okay, well this is a name followed by something else. Then up there you'll see there's a tiller and they look like little arrows pointing in towards the args. And that's effectively saying ignore everything to the left, ignore everything to the right, we just want the stuff in the middle. And that's kind of grouped by the brackets around the quoted brackets so that you don't discard the name that you originally passed. So that's effectively saying, name followed by args. Hopefully that sort of makes sense. Okay, so with the pass combinators sort of briefly discussed, I'll run through a quick proof concept at the end so that you can actually see what this stuff looks like all together. Hello well in AST form for our simple little language. It looks something like that. A call has a name which in our case has put us and a list of arguments which is the string literal hello world. You can also take that a step further and say well a program actually consists of a list of statements and in this particular case our program consists of one statement. I'm not going to go into that sort of detail because I probably don't have time and I certainly didn't have the time to write the code to go and do that. I wrote this last night, so apologies. So all our all our programs in this little demo will just be using a single statement. A single expression. So how do we represent ASTs in Scala? What I tend to use and this is purely my choice, it might make sense to do something different for whatever language you decide to write for the JVM if you feel so inclined. Here's to use a trait for all AST nodes so every AST node that you can represent those little big circles a trait for expressions and a trait for statements just so you can differentiate between two and case classes for everything else and case classes are really nice because once you've written your parser and you've constructed your AST you can basically just call print on the resulting AST object and it'll actually print out the AST as it looks post compilation. It's really cool stuff and just so easy. So that's what the code would actually look like for our language. I may have forgotten something. So then once you've got your AST constructed you want to generate your JVM bytecode and for that I've used for this particular example a library called BCL you effectively just generate Java bytecode equivalent to what is in your AST. In addition to that I've also gone ahead and implemented put S using B cell so that we actually have the put S function to call because otherwise you'd call the function put S and it wouldn't exist. The other thing that's kind of a little bit tricky with generating JVM bytecode is the verifier. It will hunt you down and it's kind of hard to give specific examples but you'll get really weird things like your code will compile you'll generate a class file go to run it using Java the stack size too large or things like that so there's a little bit of trickery that you need to do to make sure that BCL generates the expected output. Okay, so I'll give you just a brief look at something that I whacked together quickly last night. Oh sure, I'll just can everybody read that? Yep, so as I was talking about before we've got the traits we've got case classes and then back here I'll just go back to our grammar Okay, so down the bottom here we've got these terminals which are effectively just regular expressions you're literally just reproducing those parts of the grammar in scalar code and then building on that so we've got a statement oops a statement so again, the wiggly arrow sort of says discard everything on the right so even though we want the semicolon to be there we don't really care about the fact that it exists you'll notice that I've had to define the type of the expression method and that's because it's a recursive call so because a call can take an expression in its list of arguments it'll actually recurse into the expression function call so we'll see that so okay, so we've got your list of args here Repset says zero or more of these zero or more of these separated by that so zero or more expressions separated by commas where an expression is a caller or a string and at the moment there's only one function which is put s and it doesn't have a return value so it doesn't really give you much to pass a function call into another function call at this stage it's just purely to demonstrate the concept so what you actually see is once you've got all this stuff passed you've got these little guys which look like a smiley face and you then effectively match against the result so as I was saying before you're discarding the bracket literals so when you actually match against that all you need is the name and the args because everything else has been discarded otherwise you'd have to do name followed by nothing followed by all that sort of thing so it just cleans the code up a little bit and then obviously structuring that into a call and this is all because the name and args functions, return name and args for you you can just pass them into the call and so on and so forth so name in this case will be an instance of name up here because this name method takes the regular expression matches against it and returns a new name object is that sort of clear? it's kind of hard to explain but hopefully the code sort of demonstrates it anyway and then we pass the input file the entry point is the statement pass if we get success generate class with the name of the source file and the pass result which is our statement which is our call actually so then generate class all your BCL stuff I probably don't have time to go into that right now but effectively you just generate Java bytecode using built-in classes in BCL so I'll just quickly build that should have built this beforehand, I'm very prepared okay so let's call it sample.lca so I've written a little bash script that just calls that compiler class that I showed you before sets a class path for you and all that sort of thing and we'll dump it into program.class and get hello all on the other end and if you dump out the bytecode you can actually see all the stuff that BCL is generating for you and using Java P is a good way to sort of figure out what you need to do if you're not quite sure what bytecode you need to generate to implement what you need to do so that's pretty much it any questions? question fantastic widely usable is the Parsec Combinator Library I know that Parsec has or the new Parsec has quite a few edge cases with performance issues is this something you could use for something really big? but you need to be careful about how you write the grammars so it's a backtracking parser and as you say there are some edge cases like in Parsec where it can get into a bit of a nasty state because it's gone down one massive tree and then gone I can't go any further so it goes back up the tree and then tries another massive tree and it is possible to write inefficient parsers but so long as you stick to the LLK parser or something like that so long as you stick to those sorts of grammars you shouldn't have those performance issues so it can handle a whole bunch of different types of grammar but if you stick to LLK it sort of ensures that performance I think hopefully that answers your question any more questions? ok fantastic everybody please thank Tom Lee