 Luke Imhoff. I'm the maintainer of the IntelliJ Elixir plugin for JetBrains IDEs, which include RubyMind and IntelliJ itself, but also WebStorm and a whole plethora of IDEs right now. I've contributed to the Elixir standard library and found bugs in the native tokenizer and parser through my work on IntelliJ Elixir, and I helped run the Austin Elixir Meetup here in town. For viewers that want to follow along now, or once it's on YouTube, this presentation itself is on my GitHub pages site. You can see the source for it and the source for the project itself. Does anyone need to copy that down before I move on? Okay. This is going to be the process of how I introduce features to IntelliJ itself. I'm not going to mention which version stuff was in exactly, mostly because I had to skip around a lot to make this fit in the time, but it's somewhat in timeline order. People may wonder why I took it upon myself to make an ID plugin for Elixir when I could have just used a Vim or Emacs plugin. There's Alchemist after all. It's great. I've used both Emacs and Vim. I started with Emacs when I worked at Cray, only switched into Vim when I needed something that worked over a high latency low bandwidth CDMA modem on the highway when I was working during road trips. I started using RubyMind when my boss did a previous job, Nicholas Canceleri, introduced me to it. I was shocked that an ID for a dynamic language like Ruby could support find usage, go to definition, and refactor. I had been used to using ctakes for Vim, so this change that a dynamic language could be understood by an ID was mind blowing. I haven't completely abandoned my usage of Vim. I still use Vim with the Vim Elixir plugin from the command line. I also use all the Vim stuff from T-Pope to get syntax highlighting when I need to edit configuration files and that sort of thing. But without RubyMind, I don't think I would be as good of a Ruby developer as I am now. I use the debugger in RubyMind to understand how the DSLs are implemented for active record and for the router in Rails. I also wouldn't have been able to understand the Metasploit Ruby code base like a legacy project with eight years and not software engineering practices. RubyMind still understands the code even if the code isn't written well. I want that power in Elixir. I didn't think we would have bad code, but I just knew that the reason why JetBrains can have so many languages IDEs is because they figured out how to do that stuff like refactoring and find usage and go to declaration. It's just part of the API and it's not language specific. If I could teach the JetBrains API how to understand Elixir, I could get all those cool features for free. However, just getting the syntax and lexian parsing right ended up taking a year. It was exactly one year between the initial commit of the project skeleton and the version 1.0 tag. Until I wrote this presentation, I didn't actually know that. That was a complete fluke that it was exactly one year. One of the standard ways of defining a syntax is in BNF or beckus normal form, but people argue that's not normal form, so don't call it that. Which was first used for the AGL-60 standard, which by its name was in 1960. Both Yeck and Grammar Kit use a form of BNF, so I assumed it was just a matter of porting the elixir.yrl to elixir.bnf. Yeck is a parser generator in Erlang and part of the standard distribution that is based on Yeck with an A instead of an E, which is a parser generator written in C that a lot of people are used to using when they're doing C-based implementations of new languages. The Yeck syntax differs from BNF, and that uses a skinny arrow instead of colon, colon equals, and instead of using pipe for or, lines with the same rule name are repeated with alternative definitions. So all those grammar rules, all those lines are actually or together here. Finally, Yeck supports running Erlang code on the tokens using dollar number for positional references the same way we would do with irregular expressions. Dollar empties a special token that matches no input. In formal grammars, this usually is referred to as lowercase epsilon, which is kind of the curly E if you look at it in Wikipedia. Grammar Kit is a parser generator written in Java and created by JetBrains. It's actually maintained by some of the people that maintain the IntelliJ Erlang plugin. Grammar Kit's BNF format does use colon, colon equals like back of Snore form, but it has more powerful constructs above just pipe for or. It contains the tools that we'd expect from extended regular expressions like question mark, parentheses, clean star, being able to do a group that matches nothing using question mark or star or even having nothing after colon, colon equals. Unlike Yeck, there is no Java code embedded directly in the grammar. But Java code can be attached by saying that a given node implements a given interface or extends a given interface or a class. But it does mean that since there isn't raw code processing the match tokens, that the AST is whatever the grammar says it is. There's no manipulating the AST like you can do in Yeck. So this is good because the AST is just there, but it's bad that there isn't direct control over it. So it's a balance. Yeck and Grammar Kit both support a form of BNF, and Grammar Kit seems like it would support even more compact grammars because question mark, star, pipe, and parentheses could eliminate some of the redundancies needed in multi-clause rules in Yeck. So after only six days, I had quote-unquote translated the BNF from Yeck to Grammar Kit and ended up with a parser that froze the IDE, which Julius Hack Beckman, who maintains the awesome elixir list, was thankfully reported to me because my test cases at the time didn't freeze because I wasn't actually writing them on realistic projects. So this left the slow process of translating the grammar correctly over the next 359 days. At the time of the freeze, I didn't really understand why directly transliterating from one BNF format to the other didn't work. I thought BNF is BNF. Why didn't it work? So I went back and actually started to do the JetBrains tutorial step by step instead of jumping ahead and I searched Wikipedia, trying to find comp sci articles that explained the correct way to do this. In order to support color syntax highlighting and to mark syntax errors with a nice red squiggly underline, IntelliJ Elixir need to be able to analyze elixir syntax. Syntactic analysis is usually broken down into two parts. First, the raw text is turned into tokens. So instead of having individual characters, you group words together. And second, those tokens have to be checked that they're arranged in the correct order, which is normally referred to as parsing. In most programming languages, both lexers and parsers are built using generators that have an external DSL. So we always refer to DSLs in Ruby and Elixir, but technically those are internal DSLs because they're written in the programming language REN. There's also the concept of external DSLs where you need something to parse the raw text and you might have embedded code in it. I suppose you could almost say EX is an external DSL in that sense, but it's not an easy refer that way. In elixir, the lexir is built using Erlang directly because Erlang pattern matching is compact enough that a generator is not necessary. Additionally, the elixir syntax contains some features that a normal lexir generator just wouldn't expect to be able to handle. For IntelliJ Elixir, I used JFlex because it was the lexir generator recommended by the JetBrains plugin tutorial, which I will say the plugin tutorial is super helpful, but the language is really, really simple, so it's not helpful in the end. And I had to look at a lot of open source plugins to try to understand how to implement real language features. For parsing elixir, the elixir uses YEC as I mentioned to actually generate Erlang code that then does the parsing. For IntelliJ Elixir, I use JetBrains Grammar Git, which will generate Java. IntelliJ itself is written in Java. When I first started working on this, you know, unlike Day 7, I didn't really understand the difference between the generators. I thought it was mostly like just a choice the two communities had made, but there are very important differences. The first step of syntactic analysis is lexine, also referred to as tokenizing. Lexine breaks up the raw text into tokens such as keywords, literals, operators or identifiers. Input is matched using some pattern. In native elixir, this pattern matching is using Erlang string prefixes, so in elixir we'd call them careless, but it's just the normal pattern matching we're used to looking at in function definitions. IntelliJ Elixir, Jflex's file, it's just regular expressions. So both can ignore characters. In Erlang, comments are ignored by just we're not going to include it in the tokenized field. In Jflex, though, I have to output a comment token because it's an editor. You care where the cursor is. So every piece of text has to be represented in IntelliJ. I can't just say ignore that bit. For EOLs, we have to handle both Windows line ending and Linux and Mac line ending. In the native elixir parser, lines are only really kept track of to say that the line is incremented for the metadata, but I have to keep track of them as whitespace, but I still need to then filter out that whitespace before doing the parsing step for the same reasons of it's an editor you have to know where the cursor is. So the first hard feature I implemented was interpolation. Before I did interpolation, I also did simple stuff like just parsing base integers, which is binary hex and octal numbers, but that's not terribly interesting, so we're going to cover interpolation first. Adding support for interpolation was tricky. At first glance, the hash curly brace and closing curly brace that's around interpolation should work just like curly braces in a language like C or Java. But braces in C or Java can be lexed, and then the parser can decide whether they are matched. In languages like Ruby or Elixir that support interpolation, whether you're parsing fragments of a string and like everything is just a literal string, just take whatever you see. Or if you're in normal code, it totally matters. Additionally, the interpolation can be nested to an infinite depth. There's no limitation in the native Elixir parser for like, oh, you're nested five levels deep with interpolation, you can't do that. I did notice that I want to say that one of the Emacs plugins actually doesn't do infinite nesting, so mine's better. So you can almost think about like when you enter the hash curly braces, it's almost like entering into a new file of Elixir. So almost the way like include works in C. This recursion means that the non-deterministic finite automaton that JFlex generates by default can't actually parse Elixir. So what is that gibberish I just said? A finite automaton, also known as a finite state machine, can only lex regular languages, which match formal regular expressions. Formal regular expressions aren't actually the regular expressions we're used to using. So formal regular expressions can have or they can have parentheses, they can have the asterisks, which is formerly called a clean star. They can have question mark, but they can't have back references and they can't have non-greedy matchers because it actually blows up the computational complexity to support those features. And unfortunately, in order to get recursive regular expressions, you need back references. The context free languages that are slightly above what a finite state machine can parse can be lexed by what's called a push-ton automaton, which is just a finite state machine that also has a stack. So every transition state can be decided on I'm in this state, what's at the top of the stack? And you can use that to match things that needs to be recursive like matching parentheses because to match parentheses, you do an open one, you push on the stack, you see a closing one, you push it, you pop it off the stack. If your stack is ever unbalanced, so you have too many opening parentheses on the stack, or you try to pop one when the stack is empty, you know the parentheses or braces are unbalanced. The reason you have to care about computational complexity with your parsers and lexers is finite automata have a bounded time. Usually there are some fixed constant times the size of the input. But if you get to trade machines, most of us heard of the halting problem, like it is undecidable how bad it's going to be when there's bad input. And in an ID when you're typing, just by definition, as you type the code, you're going to keep hitting bad input and you don't want the ID to freeze the first time you say end instead of end. That'd be horrible. In native elixir, elixir tokenizers tokenize call calls handle strings, which calls elixir interpolations extract method, which calls elixir tokenizers tokenize means that every interpolation you call is just normal Erlang recursion. So the Erlang stack is what handles recursion. And it's also a case statement, so it's not tail recursive. So in theory you could blow it up, but I can actually make it blow up. I'm not really sure why, but I did 10 million levels and it didn't blow up. Yeah. Jflex's flex DSL only supports creating finite automaton, like I said. So I had to use the actual literal definition of a push down automaton and add a manually managed stack in Java that I would manage with the actual Java code that's executed for each regular expression in order to emulate recursion, which I'm sure we've all done that before in college. We're like, here, you can only do it iteratively. It has to be a recursive solution, so that's fake a stack. So if the elixir hits a non-escape double quote, it enters double code string state, which treats the hash curly brace as a start of interpolation, which kicks it into the interpolation state. But before entering interpolation, it records the state it came from by pushing that state out of the stack. Interpolation only differs from the normal top level body of the code in that interpolation treats the closing curly brace special to pop the stack and go back to wherever you came from, which could be more interpolation or to the root body state. The first useful expressions I could do after strings were what the parser calls matched expressions. But first, we need to cover associativity. So associativity is how to group repeat operations of the same precedence. I'm using the OR operator and the OR operators that are the word OR and the double pipes for the left associated operator, because they're easy to distinguish similarly. I'm using what are called two operators, which are just operators that are two Y. It's kind of a weird name. So double plus and diamonds, so list concatenation and string concatenation for the right associated operators. So the operators are of the same precedence, and it's just the associativity that changed the nesting. For left associated, the left most operator becomes the root of the tree with the right operand executing first. So it looks like the operators are in the wrong order when you look at the nesting. But if you rearrange it into a pipeline order, you see that it actually executes in the order that you typed it. For right associated reading the tree looks in order, but when you rearrange it to pipeline order, you can see that it actually gets rearranged. Furthering our knowledge of the background, I'll cover the different grammars that I learned that the two systems use. Yuck is a L-A-L-R or a look ahead left to right, right most derivation. So the look ahead is the L-A. The left to right is just the L, not the L-R part. And the R is right most derivation. So this is one of those acronyms where they leave off important parts. And then parts of the acronym look like they mean other parts of the actual expanded form. Grammar kit is a parsing expression grammar, which is from the family of left to right, left most derivation parsers, and also recursive descent parsers. Right most derivation means that the parse tree is actually built up from the very right most end of the text. So it actually looks like it's building up the parse tree by reading backwards. Right most derivation is also usually done bottom up from like tokens will be assembled into rules. Those rules will be greater rules. While left most derivation starts from the very beginning of the text, the way the text is actually read. And you start with, I'm going to try to match the entire file at once, while matching the file means these pieces. And so it's broken up. And so that's top down. You can technically do right most derivation and do top down, but it makes the stack explode. So no one really does it that way. Importantly, the section where it says left recursion, right recursion, right most derivation favors left recursion because it means the recursion happens at the end. Since the rules are actually read backwards, left recursion by having the recursive rule as the first rule you read means that actually right most derivation hits it last. And so it's good there. And it's actually the favored way to write rules because it keeps the stack really small. But in left most derivation, that rule that's on the left is hit first. And so it ends up not nothing actually gets consumed. You just hit the recursive loop and you never eat any input. So in a basic parsing expression grammar, all the rules have to be rewritten to do to eliminate that left recursion. And fortunately, this is one of those things computer science where there's a proof that there is no general way to do this by machine. There are heuristics for humans to do. But you can't machine optimize it to do it for you. So now that we know about associativity, let's look at how Yek does it. The Yek format has a section for declaring both the associativity and precedence of operators. The operator table doesn't completely reflect the precedence of all the operators because some of the precedences get swapped in the Erlang helper code like so that we can say not in to mean not in the thing so that the not applies to the in and not to the entire expression. The weird non-ASOC is for non-associative operators which usually applies to prefix operators like the ampersand for capture or uninary operators like plus to make something positive and minus to make something negative or just exclamation point and the word not for not. Higher precedence operators with a greater number in the table can act as arguments to lower precedence operators so that like factors or multiplications are arguments to addition. There isn't a unified associativity and precedence table for grammar kids B and F. Instead the precedence is ordered choice so in a parsing expression grammar the OR isn't try all these options and pick the best one it actually means try them in this order even though they use pipe the same way regular expressions does which is slightly confusing. So the the precedence is here just by the order you list them which is the same order from the previous slide. Instead of stating the precedence of the operators it's just the press the precedence of the operators this is the precedence of the operations. So you can see here this is the entire operations with both operands not the individual like plus and minus signs. The pattern for binary operations are matched expression on either side of the operator as operands. Left associativity is assumed by default so right associativity is indicated in the rule options in curly braces which is right associative equals true. So this entire presentation I've cranked the font size as much as it'll go inside of reveal. I'm sorry if it's too small. You may have noticed the rules appear recursive because there's matched expression as the leftmost operand and I just said that's going to blow up. But all these operations work because of an extension that grammar kit gives me. The rules for parsing expression grammars say that left recursion is possible but if you declare the operation rules to extend the root expression grammar as I'm doing in the video with an extends directive magic occurs and left recursion is okay. And I do say magic occurs because when it works it works and when it doesn't I just get warnings because it's complained about left recursion and since it's recursion it can't actually say that this rule here you forgot to do extends on because at any point since it's recursive it could sample the stack and be like well these are messed up. So if I mistype a rule name for extends or I forget to say extends all of a sudden the the magical parsing that makes left recursion okay will just disappear and I'll get the ID frozen again. So most of my time when I was talking to Jose at conferences about I can't get to work it was this. So once I trigger Pratt parsing I get this really nice and understandable code. So it's so nice it actually tells me how each operator is understood and the precedence order. So Pratt parsing is an extension of recursive descent parsers which include the parsing expression grammars that allows for optimization of operator parsing by noticing that the pattern that most humans would do to remove left recursion and write operator tables can actually be done by a machine. So like this is one of those things we're like hard computer science will say there's no general solution but there turns out that there's like a non general solution that works really well for a subset of the problem that we actually care about. The optimization involves noticing that eventually all the operation rules get to rules that aren't left recursive because they're either tokens or they're prefix operations and that way we can eliminate left recursion by looking for those rules. So here we can first parse the the prefix operators unitary operation and add operation or the identifier expressions or the access expression because they're just tokens. Once some of the input is consumed we then go into the match expression underscore zero function that does the actual magic of Pratt parsing. The pattern used in match expression and max expression zero is from Douglas Crockford's top-down operator precedence parsing implementation in JavaScript. So that was a blog post from 2007. Grammar kit was written I think about like three years ago and then that jumps all the way back to the Pratt parsing paper in the 70s. So there's these big gaps between people coming up with new parsers and so it's very hard to Google for these results. In the actual code G is the right binding power of the currently matched operator. Only operators with a stronger binding power because they're higher presence can match when recursing. But if a stronger rule is matched on the recursive called the match expression then the while loop allows for matching operators of equal right binding power. The left associative in match at the beginning and the or expression at the end each follow the pattern of when you call recursively you call with the same G that you're testing for but since it's G less than six when you pass six that means you can't match that same operator. This ensures that adjacent left operators at the same level are matched by the while loop so that they are actually parsed as siblings in the tree and the left directive at the very top where it says marker M equals actually rearranges the tree similar to how the elixir pipelines are rearranged. For right associative operators like when type pipe in match the if clauses recursively call match expression with the same value of G so that the recursive call can match the same operation and since the recursion happens so that the let me do it right for the camera so it happens on the right side so it turns out that the grouping is just right associative by default from the recursion. The first calls I implemented were no parentheses function calls and that was just a matter of that no print expression was the first call I hit in the top level expert rule in the YEC grammar. I knew from using elixir that these rules had to match no parentheses list calls that I was used to writing so even at first that I couldn't understand this very complicated grammar I knew what the code had to look like eventually in a proper elixir. When converting the YEC BNF to grammar BNF I need to combine clauses using extra features available in the parsing expression grammar. This also helps me reason about the rules as parsing expression grammars have more regular expression syntax and as weird as it is now to think that I can think in regular expressions is just true mostly because because it gets rid of the redundancy it's much easier to see the branch point where a decision has to be made so here I'm taking the duplicate of rules and then combine them together but then I hit the problem at line three that the match expression grammar in order to be optional has to be in an anonymous group and the the way that grammar kit makes Pratt parsers you can't actually put you can't put match expression in a subgroup because it won't work anymore it has to be a direct child so this meant that for a long time I couldn't figure out how to make this work with Pratt so I decided instead to break it up into two different sections I'd make the unqualified calls which only need an identifier the direct child of expressions and that way I avoid the need for the Pratt parser to understand them for qualified calls I can just make that a call operation and match expression and so it can use the entire match expression as a potential name for the function calls so this could handle remote function calls with aliases and local function calls with just a with like off of brackets or dotted off of a map to actually do the arguments instead of having the four clauses if you look at the actual yet code three of those clauses are just there to throw errors so I didn't want to have to do a special error handling for all three of them I wanted to combine them so I made a no parentheses strict rule in the Java code so that I could search for it and actually use some of the JetBrains API power for the first time which allowed me to add an inspection to mark the error in a quick fix so instead of immediately throwing an error the way code that string to coded does in elixir I can just mark it with the inspection as an error and then if the user goes over it and hits alt enter I can automatically fix it for them being able to correct errors is one of the reasons that an ID's grammar can be more permissive than a compiler's grammar permissiveness in the parser itself allows for more robust heuristics to mark the errors later for valid arguments call args no friends many three causes are just a way of saying there needs to be at least two arguments and they can have keywords but if you have one argument that's positional the other one must be keyword each argument is a call args no friends expression there are a lot of comments here because I kept screwing it up most importantly I had the a bug was I had to put the look ahead of not keyword pair colon after match expression because match expressions can also be keyword pair keys which I'll explain here match expression contains access expression access expression contains string line but a string line but a string or a careless is contained in quoted which is contained in keyword key which is contained in no parentheses keyword pair which is contained in no parentheses keywords which is the tale of no parentheses call so I'm now eight steps removed recursively and remember most studies say five to nine pieces of content you can keep in your head at one time so this would keep happening where I'd add new rules and I'd totally miss the recursive implications of the grammar and it would all blow up and be like great why did this happen the native parser doesn't have this problem because it actually tokenizes quoted keyword key atoms as keyword identifiers because unlike flex the Erlang code can just hold on to text until it knows how to identify it correctly with flex I have to if I have text I have to name it as a token or put it back on for someone else to call it a token I can't just build up text any no parentheses function calls on their own was simple enough begin the no parentheses calls as the right most operand and only the right most operand in a match expression took up the majority of version zero dot two dot one for a long time I was stuck on how to ensure that a no print expression would only appear on the right hand side because you think about it you can only have no parentheses at the very end when we write code because otherwise you'd have two print no parentheses calls next to each other now be an error in the actual yet grammar says that like makes it very clear that you can't do that and the issue kept being that the automatic error handling built into the Pratt parser would match too early and would like not parse the comma and they would blow up with like why is our comma just sitting there what finally cracked the problem was when I realized that the no parentheses calls by the very nature would consume function calls to the right of the first no parentheses function call as either a positional argument or if it was buried in a keyword key it would just look as a keyword key value so it's not actually that difficult to parse or for the longest time I was hung up on how yet did it and so I didn't step back and think about how the code would actually work using the precedence rules for Pratt parsers any earlier choice any earlier choice in the order choice ors can choose a later choice as a substitution for the base rule so starting with the subset of the rules including dot and alias the aliases substituted into the lower precedence dot operation because lower precedence operations wrap higher precedence operators think additions taking multiplications as arguments on the right hand operand match call operation is substituted which can be expanded to another match expression which can be an identifier which finally matches a remote call like colonel that inspect with no parentheses taking a range and a keyword arguments structs as falls through part of version zero dot two dot one I had to manually compare the PSI format output by test cases for intelligent elixir and the output of code dot string to quote it in IX is not easy to compare the quoted form to the PSI tree as shown so I normally had to indent the quoted form to match the PSI tree I'd have to do this anytime I change the nesting the grammar such as when an introducing new error handling or making some nodes private so they didn't appear in the final tree because the JetBrains grammar could API emphasizes don't put nodes in the tree that aren't necessary for understanding the code because it just will bloat the memory so right now looking at can anyone tell me if those match yeah so the the issue is is that the actual the match operator equals is in the middle where contains inline and not the top like it is in the final quoted form so I don't always have to make these translations in my head back and forth and it's doable but it's really really frustrating when you get like over a hundred tests to have to do it for everyone to make sure there's not a slight error in the associativity rules and so on intelligence SDK is written in Java and I knew I could use Java to talk to Erlang through J interface so I decided just have J interface tell me if the PSI was correct by by having each PSI element in the tree respond to a quote method which would spit out the J interface objects that match the output from code not string to quoted the first step in using J interface for creating that Java node is using OTP node the first argument is the node name and the second argument is the cookie the local remote names are very close the Java node is using a dash while the elixir knows using an underscore once you have a node you need a pin to send to the elixir code so it can talk back to you but Java doesn't have processes instead they have a mailbox and you can ask that's pit Java doesn't have a pattern matching so when you receive you just get the first message that's there but thankfully I didn't actually have to worry about messages I couldn't match so I could just call receive at the time I hadn't seen Joe DeVivo's talk on gen java so I wasn't sure how to fake a gen server call format so I just used the map IntelliJ elixir is very simple it's just a supervised gen server that takes a quote for map and runs code string to quoted on the code and sends back the status in the code form originally I thought the supervisor was overkill but it turned out that once I started to check that the ID could parse around errors like you expect an ID to do you don't expect one error to screw up the syntax highlighting of the rest of your file that it would actually kill off the coding server and the gender will just restart it so it's great the only thing I had to do is that J unit can test so fast that I had to increase the number of process crashes per second to like two thousand instead of the default five in yet one has to repeat the pattern for coding the tail of the infix operations for each tail so the pattern short but there's no indication the code that that couple of two arguments is the same for reason so it's almost like a magic number of like did the programmer intend this to be this way for all of them on purpose in grammar kit I can have the nodes implement interfaces as I said before and then define methods for those interfaces so I can more explicitly say that these quote the same way on purpose by having them do the same interface unfortunately it's Java so like constructing anything with J interface is so many more lines of code than the actual or lane code would be PSI element has an except children's so quoting the file just becomes a recursive action of coding the things are at the top level scope and just telling them to quote their children and so on finally the top level anytime you have multiple lines is called a block and so the early in the native parser is very simple if it's one thing it just doesn't become a list if it is multiple things it becomes this hidden double underscore block with the list and I can do the same thing in Java but once again with a lot more lines of code the actual comparison just comes down to calling equals on the J interface objects but there were some caveats because of how equals is implemented in J interface from the basic type section of getting started guide for elixir or just by mess around IX we know that sometimes lists of numbers will render as a single quoted careless from this we have a mental model that the careless is just for my decision a whole list of numbers is rendered as a careless if all the numbers are printable ASCII there's even the trick of putting zero on the end to make it unprintable if you want it to still look like a list but the name careless is much more than that OTP Erlang string is only returned if all the numbers fit into a C care type of zero to 255 and as far as I know Jose and Robert Verding didn't even know this because I asked them and they said it was just printable characters this isn't just an artifact of J interface the actual Erlang wire protocol uses a different tag like a different piece of binary data on the wire to signal the difference between an Erlang string versus generic list even go so far to have a special tag for an empty list the first bug I discovered was that the native tokenizer didn't handle a comment after a dot but before the identifier correctly for identifiers this doesn't matter but if the actual function is an operator name that scope to a module this becomes a problem in order to handle this in my parser I just made comment stay in the dot operation state in the native parser Jose had to fix it by doing something similar where if you see a comment it just gets stripped the only thing that separates an ambiguous parentheses and a parenthetical function call is the space between the identifier and the opening parentheses worse a space or no space between identifier and opening square bracket is both valid no space and it's bracket operation which use the access protocol a space and is listed as an argument to a no parentheses call so one space can have completely different meaning it's even more important for brackets where one is what used to be an access operation and I guess it's just a dick look up and the other way where it's a function taking a list the elixir tokenizer handles the no space in space situation by actually using a different atom to signify the identifier I didn't want to do that so instead I did a little trick that I didn't think that would work I actually issued a token that represented no text I just called this token call and this way I was able to match that there actually had been no space between the identifier and the opening brackets are open parentheses stabs contain the stab operator arrow that I was used to stab the lambda which is used to map anonymous function arguments in function clauses they're also used in pattern matching in do blocks but from the parsers perspective stabs don't actually need a stab operator they can just contain an expression such as the first part of a try block but that also means since stabs can occur in parentheses that any parenthetical group is a stab so one plus two in parentheses is also a stab which is weird because there's no stab in it to get a to get a plane group you just go through axis expression you get the stab expression you get the opening parentheses you get the expression itself and then you get the closing parentheses the other part of stabs that I couldn't figure out was how any of the anonymous function clauses could have more than one line in their code because it only has a rule for expression there is no expression list that blocks use so but then I noticed that there was a pseudo expression list because the second rule for stabs is a stab can be a stab followed by EOL and a stab expression so you just could get a bunch of EOL expressions on the end but that still leaves the problem of those expressions aren't associated with the part before the stab which is like the pattern match for the function clause so this is where yet grammar is a little sneaky it doesn't actually build up a parse tree a function clause at a time and set the stab operation and plane expressions are all siblings and the Erlang build stab function calls an axis expression merges all the adjacent expressions into the children of the prior stab operations until the next stab operation is hit like I said I thought this was a dirty trick to not have the grammar actually represent the AST and I couldn't actually do it that way in Java because the the structure of the PSI tree is used for stuff like fine usage so I couldn't rearrange it this is one of those things where yet building the AST tree on its own is somewhat more helpful so the key to making it work in grammar kit where the AST has to be directly is the negative look ahead in the stab body expression which is the second line that is private so all it does is it looks for is like if this is the start of a stab expression then I'm going to stop the current expression so instead of grouping after the fact it's constantly looking ahead and if it hits another one it's like okay I'll stop this one group it together and then I'll go on to the next one unmatched expressions get their name for the fact that the two operands in a binary expression don't have to be the same type unlike a match expression where they have to be the same type the other important characteristic and the one that everyone wanted is because they contain do blocks so open to this point IntelliJ Elixir couldn't handle the most rudermen to Elixir module because it couldn't actually handle def module do end block expressions are what would be expected from the previous parts of the grammar and the Elixir production code identifiers or function calls followed by do block do blocks contain stab which we've already covered so the only thing new here is block list block list is is a list of block items which are block eol's which are block identifiers that can be followed by an optional stab so what's a block identifier so block identifiers are the keywords after else rescue and catch this means that those are real keywords in elixir so due to the power of macros if we don't have if and unless as keywords they're just identifiers but the else inside an if or unless is a keyword so we can't invent new keywords for elixir we can only rearrange them so in theory you could make some new macro that has more than one else the dangling else problem is a problem with languages like c or java that have an optional else when nesting the problem can be resolved by either saying that the else binds to the outermost if or the closest if this problem is also possible due blocks elixir when resolving which function call a do block binds to in an actual elixir grammar the do block always binds to the outermost function call which turns out to be an easy choice for right most derivation parsers like yak but for left most derivation parsers like grammar kit all the textbooks actually told me don't do this and then moved on to the next section it was just like yeah don't do that choose the other way it's easier but it turned out this was only problem with ll one type parsers and grammar kit isn't an ll type one parser ll one means left to right left most derivation with only one look ahead but parsing expression grammars as i showed i can do as much look ahead as i want when adding unmatched expressions to the elixir b and f i actually ended up taking match expression out of the top level expression which if you think about looks weird like how do i get match expressions anymore well it turns out the important distinction is that unmatched expressions can either have blocks or not but match expressions have to never have blocks because to get the outermost binding we want to say that the outermost function call takes a do block or not but every one of the arguments to that function must not take a block or the it screws with the syntax so while in the x grammar a block expression is a match expression followed by do block in grammar kids grammar a match expression is an unmatched expression that can't have a block before releasing version 1.0.0 i wanted to make sure that i hadn't missed some syntax that was valid but i hadn't thought of when manually writing this that i hadn't thought of when manually writing the elixir test cases so i knew i couldn't write quick check rules for this because if i couldn't understand how to write the grammar once write it another time wasn't going to make it any better and doing generative testing with quick check would require me to just write the grammar rules over again so i decided to parse all of elixir laying elixir as a test case so i ended up did finding i did find bugs in my grammar but i also found bugs in the native tokenizer and parser again the first one was with function captures so function captures have four variants you can capture a remote function where you have an alias and then a function in the arity you can have a local function with the arity but you can also have weird ones like pipe pipe slash two or the actual word or slash two it turns out that the word or one has an error in before elixir one dot one while the reference syntax does end up quoting just like division i ended up needing to have a special handling for the to be able to lex it as an operator because point it out as just like an or operator made the parser super confusing and it was just too much to handle so i actually do just say it is an operator token unlike the other rules and i can just do this in one rule because i have regular expressions handy to fix it in elixir john isaac stone just to find a new erling macro to look for the operator keywords and then check those as keywords but only if there's a capture operation so in my version i looked ahead i looked ahead and saw if there was a slash in his version he checked instead behind and see if there was a capture the third bug i discovered had to do with piping blocks so surprisingly there was only one case in the seren library that actually did this so if people had like stylistically said we will never pipe blocks this would never have been caught so the problem is is that if you have non-block match expressions and you pipe them the associativity is maintained but if you start with a block and then pipe it makes it look that the pipe operator is right associative so you see we're using pipes everywhere so it's kind of hard to understand but look at the line number it should be that line three pipe happens first and that one and two are piped together before three is piped on the end but when it's when one takes a block then instead two and three will be piped together and then one will be piped to that group so i wasn't actually sure i i just assumed like the not in that i had missed some erlang post-processing for this but joseph villum actually confirmed that it was a bug in the native parser because of the way the rules are written in yak it didn't actually look far enough to find the best match and so it screwed up the associativity so because i use a different class of parser that doesn't do reduce and shift operations i was able to stumble upon this bug having more test cases because i didn't trust my own understanding of the original grammar and how to translate the grammar kit also let me find the other bugs so an alternative implementation is a good way to ensure that the tokenizer and parser for language has covered all the edge cases one year one thousand three hundred fifty six commits three nate three bugs in native elixir and the parser was complete and i'm only 12 minutes over