 Thank you. So I'm Naomi and I'm going to talk about the MiniR parser generator for Rust. Rust parsing in general and more especially the MiniR LR parser generator. So short disclaimer first, I put a lot of things into the description of my conference in the CFP because I really, really wanted my talk to be accepted. But eventually I just got assigned a half an hour slot so I would definitely not have the time to cover everything that was in the resume. So sorry about that. So let's under the beginning. What is MiniR? MiniR is an LR parser generator which was initially designed for OCaml. That's it. It was written in OCaml to produce OCaml parsers. It can also produce cork parsers. It's developed at the INRIER which is a French national research institute where I work. And it's very widely used in the OCaml words. I could have liked at the time to dig a bit more deeply on what is LR parsing but I would not have the time for that so let's just recap the basics. So if you know YAC, LR parsing is basically the kind of drama that YAC handles. Well, in fact, the old YAC handles only LA LR which is a more restricted class of grammar but modern YAC likes and all the full LR class. It's the class of grammar which are context free or context insensitive so grammars which does not contain any kind of ambiguities that could be resolved with only a broader context. And you can parse them using pushdown finish state automaton which is augmented with a stack of states for the context. And it has the very useful property of a runtime complexity being lining on the size of the inputs. Well, so LR parsing is better explained with an example. So this is an example of a linear parser file. So if you know YAC, it should be very familiar. It's really the same as a YAC file. You have at the top Rust definitions which will be included verbatim in the outputs. Definitions of tokens and their types. Notice that we have to include the type for every non-terminal like export and not only for the tokens. This is because the Rust type in France is much less powerful than OCamls. Then we include the double person sign just like YAC for absolutely no reason. And then we have a list of syntax rules which look roughly like BNF grammar definitions. So this kind of parser is kind of easy to write if you already have a BNF definition for your grammar. You just add semantic actions which in our case contain Rust code. So this is a valid many Rust parser which can be compiled down to Rust code. My work has been to port many of the Rusts which has been relatively easy because many was already supposed to support several languages. It has in fact two backends for OCaml. One which produces tables that encodes the transition tables of the automaton. And another that encodes the automaton as a nest of recursive functions. And a backend for Coq. Because yes, by the way, many can produce proved parsers which many is not proved in itself. But there's a technique here in which a certified tool, a tool which is itself verified, is used afterwards to check that the produced automaton is indeed matches indeed the same language that the input grammar. And that's the technique which is used in the COMSERT certified compiler for the C language. It uses many with this technique to verify the parser and it works. So many is a pretty solid piece of software. So what I've done is essentially that I wrote a table-based backend for Rust, table-based because it's easier. Maybe in the future I will try to write a code-based backend but for now it's a table-based backend and it's pretty fast enough. That's how it looks. It's ugly. I omitted most of it but you can recognize the structure of Finite State machine table, you have the action table which when indexed by a number of states give a number of tokens gives the next action. And below that a set of rules which contain the code of the semantic actions which will be executed when a reduction is performed. Notice the matches which are boilerplate code generated by many are to extract the semantic data out of the stack and push it back on the stack afterwards. Here underlined by the comments is the verbatim user code that has been copied from the parser definition. The matches, what those matches actually do is that they have to unwrap the semantic values from the stack because the stack contains the data of several types so they are tagged and those type checks are actually completely useless so that's why the over case is unreachable. Most modern parser, not even modern, most parser generator just use unsafe code to extract the data but for now it uses this type check which was pretty useful during the development. Fun story, what triggered the original offers of many interest into other parsers was that they discovered that those dynamic type checks or unsafe codes could be eliminated and replaced with GADTs so that's why they wrote many around the first place but at the time OCaml did not support GADTs. Now it does but the back end of many would be too complicated to rewrite so the reason why many was written in the first place has never actually been implemented. Okay, so how do we use that code now? The packaging was really the biggest problem here because we have to distribute something which is a mix of OCaml and Rust code. The generator is written in OCaml, the runtime library, yeah I forgot to mention that but on this example you can see that the code that will actually interpret the table that will actually execute the automaton is not in the generated output it leaves in a separate crate which is referenced on top which is many runtime and it has been to be it has to be distributed too so you have to distribute a generator written in OCaml and a runtime written Rust. The first option was to use OCaml which is the OCaml package manager which is obviously not a good option because when you're a programmer and you're writing Rust code and you want to use many you shouldn't have to know anything about OCaml. You shouldn't have to know how to use OCaml or whatever and in either if you are just a Rust user and you use a library and this library itself uses many you shouldn't have to use to install the OCaml package manager just because some dependency of dependency wants it. The other option is cargo bin. Cargo bin is a feature of cargo that allows to execute to install packages that contain a single binary but this is not sufficient because we still need to install the runtime with it plus a package cannot depend on the cargo bin so we would have exactly the same problem if you just use a crate who itself depends on the crate using many you would have to cargo bin install yourself because the cargo packages cannot register this dependency. Option three was just a regular cargo package but the main problem with that is that the generator so a cargo package which in its build script build.rs script will invoke the OCaml make file to build and install the generator but cargo will run this make file with a prefix which is a completely obscure location so it will be installed in a completely obscure location which makes invoking the generator binary very complicated so the ad was to write a wrapper crate which knows statically at compile time the prefix in which the generator is installed so that it can help you invoke it. Let's see how that works in practice when you want to use many in your package you start by stating that you're going to use a custom build script and that that custom build script will use many as a dependency so that's why the build dependency section of the cargo terminal file is for is for registering dependencies that are needed to run the build script not to run your package itself and in that build script you use the many a crate which is this wrapper library and you ask it to process your parcel file that is you do not explicitly invoke the generator on the command line uh you do not explicitly invoke the generator binary as you would do with most possible generator you go through a library which does that uh on the file system it looks like this that is when cargo will build the dependencies of your package and will build manier the generator will be installed in that location with a mangled version number uh version hash um but the lib manier the wrapper library knows statically at compile time the path of this wrapper of this uh out there so it can reference the name of the binary then it's all good only the the generator will produce uh open uh parcel output file that will be put in your packages own out there which you can reference using the environment variable out there so that you can include the generated file in your code like you do with most rust crates that uh generate code and we'll see later how uh how to actually invoke the the parcel from the rest of the rust code for now there's a detail that still has to be covered regarding packaging which is the runtime so the runtime there was a problem with the runtime which adds since the manier crates now contains this wrapper library it has to contain two rust libraries one for uh the wrapper crates and one for the runtime and cargo was not designed to fit multiple libraries in a single package in cargo the notion of package is more or less the same than the notion of create a single package is a single crate uh so we had to cheat and by cheating I mean that it's the okamu make file which is responsible to compiling and installing the runtime library in the uh in the out there location okay now we need cargo to be able to find that runtime library when uh when compiling the generated parcel because this out there is absolutely not in the rusty search path by default but then again the wrapper library knows the path of the runtime so we added a second function which is called cargo rusty flags and the uh what's this function do if you will if you were to execute that build script manually would be to print something like that under the standard output that is it will print the path to the runtime but prefixing it with this uh little text cargo rusty link search and cargo when running the build script will interpret this output and it tells you something like add this directory to the link in search path and this way cargo is able to uh compile the the generated parcel and link it to the runtime so it's a pretty interesting use case of cargo here it's pretty actually but it works uh we have a project that contains codes from different languages the it's quite complicated from the point of view of the point of view of the maintenance of the package that is me um it's slightly okay from the point of view of the person who directly uses many in its rough project it's a bit complicated but once you know how it works it's quite easy to get along with it and um that was my biggest concern it's completely transparent for uh the indirect user of mania that is someone who does not use mania but uses in its rough project a crate which itself uses mania for this programmer it's completely transparent all that he needs to do is have the okamal binary to lie somewhere in the path you just you just have to install it with uh your distribution package manager and that's all you then cargo and also all the rest okay so what now now we have our parser in the sub module and we need to invoke it this is the lexer interface that mania expects that is the interface that mania expects from a lexer uh it looks weird it does not look like an iterator it looks like an infinite iterator uh you see that the input function can either return an error either uh return a token a companion with its location but never say stop this is because uh mania as handles what is known in a lot passing as default reductions which means that sometimes in a in a state in an other state of the automaton there is a single action which is possible which is to reduce you do not have any other possible actions and so this way you can perform this action without looking at the looker head token so the advantage of this is that you can perform a lot of reductions without using the lexer which is a performance benefit but the side effects the positive side effects of this is that you can now detect the end of the if the final seat of the automaton is such a state that is a state which is with the default reduction where only one action is possible you can now detect the end of the stream without requiring any look extra looker head token uh which is a good thing because it leaves if your passer does not consume all the inputs to match the start production it's leave the rest of the input untouched for another passing run or to give to another passer or another part of your program or whatever for example for a command line passer it's especially useful because you can run this the whole passing phase and leave the rest untouched for the next command um but still most lexing tool for rest out there including my own uh reflex use a simple iterator interface so what if I just want to use a simple iterator here uh for that the many run time crates provides a simple structure which is called iterator lexer you just give it as an argument an iterator of the lexers and it will convert it to the right interface and if the iterator was ever to return known to indicate the end of the of the stream the iterator lexer would return a lexing error because since many does not need an extra an extra token to detect the end of the stream if it reaches the end of the stream while it still needs an input it's an error you actually needed that input and then it's the end is pretty simple there is a single function generated per each passer you give it a lexer and either you get a successful pass or a syntax error or an error that was reported by the lexer and carried along by many I also included an example using rustlex uh this is a very simple lexer with only five token types an identifier token type any identifier which is made of alphabetical alphanumeric characters the lambda keywords the opening or closing parenthesis and the dots and then there's slightly uh few boilerplate code you just have to add a niter once uh at the end of the token because uh in our case the grammar in the in the example I've taken the grammar would still be ambiguous uh without a new f token at the end so we explicitly edit and the enumerate is to use the numbers uh it's because my lexer doesn't have a proper handling of positions in the input stream so you just use the index of the token in the stream as an indicator of the location so that if a syntax error is detected it will be reported it will say for example at the sixth token there is an error okay so that's it that's how a many address works um I've said in the in the presentation of the conference that I would try to do a comparison of a lot passing and of many with other passing tools for rust out there so we'll try to do that without digging too deeply and uh staying as honest as possible so short lists not exhaustive of other passing tools for us that you may have heard of there is for example num I think it there was a talk about it here last year or there was a talk about it's alpha but I don't remember about what it was um num is a parsers combinator library for for rest um so let's try to to make a um a comparative uh list of differences between error passing and combinators the biggest pros of error passing is that you can it's hard to make a mistake uh error passing is by design on those only grammar that are unambiguous so it will check during the construction of the parser your grammar to check that it does not contain any conflicts whereas with combinators it's easy to accidentally make an unambiguous an ambiguous grammar and you will only notice it when you debug some weird example that does not pass the way you think it should pass and that's the way you will notice that your grammar is ambiguous and sometimes you will have to rewrite the entire grammar in order to get rid of that ambiguity so that's the big advantage of uh error passing of the combinators the other being that they're very fast they're always linear in uh in the size of the inputs whereas parser combinators can with certain pathological grammars in the presence of certain pathological inputs trigger exponential behavior at runtime it can happen because of uh the left recursion the top down left recursion in parser combinators um on the other hand the biggest disadvantage of error passing is that combinators are much much easier to write like um error passers are pretty easy to write if you use to writing bnf grammars of this kind of thing but still and uh even more personal commenters are quite easy to debug whereas in error parsers when you have a conflict it's usually requires kind of it's requires that you kind of understand or error passing works under the hood in order to be able to debug the conflicts many have tries to make it easier by explaining the conflicts in terms of the grammar or whatever than in terms of the automaton but it's still requires a bit of thinking to correct them properly and uh last and more important uh it's pretty hard with error passing to properly handle runtime syntax errors that is the parser is correct but you fit it with an input for example uh compiler compiles a program which is syntactically wrong you provide it in an input which is malformed the parser should report to the user um an error which should be as precise as possible about where the error occurred and why and error parser makes it very hard to do that properly and I will try to give a short uh introduction to the solution that mania proposes to this problem mania introduces a flag uh which is list errors uh by the way another handy feature of the cargo package for mania is a link binary function that you can use in your build dot error script and which will create a sim link to the generator binary at the roots of your source tree um which is easy which is cool during the development of your parser because you can iterate quickly try new flags make change to the grammar without having to edit the build script and uh run the whole cargo process at each uh at each time so if you use this list errors uh flag mania will generate a file containing the description of all the possible states in which an error could ever happen it looks like that well this is a single entry in that file and uh I definitely don't have the time to explain uh in details what this file uh how how these these descriptions work but um to me shortly h3 contains a description of an input that uh of a kind of input that could uh they could lead to a certain class of errors and the user can write the user here I mean the user of mania the programmer will write a mania parser can write in this file detailed error messages then you give that file back to mania with another flag which is compiled errors you can also invoke it through the wrapper library with the compile errors um function and it will produce another rust code file that you include just near to your uh parser file and it generates rust code so that when you get a syntax error at the end you get that error object and this error object will have an as str method which will uh return the exact uh error message that was referenced in the error file uh it's uh it overflows but uh here we see that the parser returns an expected token expected an identifier which was the error message that we wrote in the error file so that's the mechanism that mania exposes to end all runtime errors and it takes a while to to get used to that mechanism but it's very powerful let's continue with our uh comparative list another tool that you probably have the order of is la lor pop which is written by uh nico matakis uh it's a full lor parser yes it's not la lor uh you can ask him what the fuck this name but it's also just like mania a full error one parser and the big advantage it has of a mania is that is way way much better integrated with the rust environment it has a nicer syntax which looks more closely to the right which uh yeah resembles more closely the rust syntax than mania syntax which is basically the same as yak it gives nicer r messages which integrates nicely with cargo uh it has a nice reporting of conflicts which uh i believe is the same of mania but then again with nicer better presented error messages it has macros uh which allows to abstract away repetitive uh fragments of a grammar and which as far as i understand have more or less the same goal and expressivity as mania's parameterized non-terminal but those only mostly work in uh the current implementation of the rust backend that is they work but they require a lot and a lot of type and notations which make it a bit painful to use where la whereas la lor pop has type inference uh which we like in mania in short la lor pop is designed to be a passenger that are for rests and because of that it's more comfortable to use in the rest setting than mania uh okay so there are a few other features that i didn't have the time to dig into but that i will mention anyway and they mostly don't work at the moment but that's where we're looking at right now uh mania as in the okamu version of mania is an incremental api that is instead of exposing just a single function that you call and either it gives you a successful result either it gives you an error it will expose all the steps of the passing process and at each step you can inspect and even modify the internal state of the parser and uh this is useful because it allows you to implement uh specific error recovery strategies that are specialized to the language you're trying to pass uh whereas the default error recovery mechanism for la for a lot of parsers is not very flexible and not very adapted to the variety of languages that you would want to pass with a lot with a lot of parsers uh this incremental api is implemented in the rest back in right now but it's not exposed because it exposes unsafe internals of the back end uh and if you were to use those functions incorrectly you you could uh trigger undefined behavior or make the parser crash so this this should work in uh when in okamu again um they use gadt to make sure that this interface is handled correctly but in rest we still have to figure out a way to make it work safely and libraries many has uh allows to write a portion of grammar definitions once and for all in a library that you can reuse in all of your parsers it should work but it needs porting because the standard library of many contains um semantic actions that are written in okamu and will not work with a many of them with a rest parser and there's probably a whole lot of other issues um notably lexical conventions uh if you know the accent text you know that the type annotations for uh terminals and non-terminals are put inside angular brackets and the fun thing the the fun thing is that the lexor of many is completely stupid and it relies on the assumption that a valid okamu type cannot contain any angular brackets so in order to pass the type annotation it will just take the first closing brackets but rest types can contain uh angular brackets because of the um generic types so if you have a type like that many will interpret box expert without the closing bracket as being the type and it will fail so you can work it out by using a type alias but it's cumbersome you have also uh case warnings uh because many expect non-terminals to be uh not to be capitalized but each of those end up with uh a name in the produce rest code that rest expects to be capitalized so either you have a many warning either you have a rest warning depending on how you name your uh your names the useless tagging of tags of stack cells uh the thing I was talking about uh the useless uh dynamic type checks when retrieving and pushing values uh back and forth the lr stack uh this should be done in unsafe or with some kind of uh of typing I don't know how exactly we could use the rest type system to do that but maybe it's possible uh there are a lot of features like spring into multiple files which is not supported and probably a lot of other bugs that I'm not even aware of so use it try it and find and report those bugs thank you for listening