 Thank you so much, we're going to continue again, round of applause for you. Thank you. So I'm here today to testify writing a shell parser is nothing different from a trip to hell, okay? And I will also try to show you how a strategy to survive, to try to survive to this. But before that, I just want to give you some context. So Ralph Treinen is the back of the room recently launched a research project. His purpose is to verify the package scripts of Debian. And that's very important because as you know, these scripts are critical pieces of software, okay? They build our software, so they must be verified. We really need strong guarantees that they will not break our systems, okay? So we want to implement a static analyzer for these scripts. The problem is they are written in POSIX shell. So the very first thing that you have to do is to write a parser for POSIX shell. And this is when the nightmare begins, okay? Actually, we are interested not only to write a POSIX shell parser, but also to write a parser that you can trust because we want to use it in a verification tool, okay? So if you want to do things right, you have to open a book and to listen to the whole wisdom of your ancestors. And they said to us that parsing is a difficult programming task, okay? So you really have to decompose it into simpler programming tasks. And they gave us this beautiful architecture. They said, hey, first implement a lexical analyzer that will turn your characters into a stream of tokens. And then implement a parser that will recognize this stream of tokens if it complies with the grammar of your language. And it will give you a beautiful parse tree, okay? And they worked a lot for that. So much, so good that they designed very nice declarative language to build these boxes. Lex and Yark specifications are very declarative. They are very close to the specification. You can trust them. And moreover, you can use code generator to get code that will be correct with respect to this specification. So that's a very beautiful architecture that we will never forget. So you open the specification of the shell language. And at first sight, you are happy because you find in there a Yark grammar. Hooray, I will be able to implement easily a parser. But actually, that's not really a Yark grammar. It is annotated by very weird sight conditions that are out of reach of the expressiveness of LR parsers. And you go further inside the specification and you understand that it is low level. It is unconventional, informal. But you can't blame these people of this group, okay? Because the real problem, the actual truth is that the shell language, it's an aura. It's something terrible. Okay, if you start to understand this language, you really understand that lexical analysis is parsing-dependent, shell-nesting-dependent, that actually the whole syntactic analysis problem is undecidable and ambiguous. And in addition to that, you have a lot of irregularities in the language. Okay, I will try to convince you that what I've just said is true using some examples first. So first, when you read usual lexical conventions, tokens, they are specified positively using regular expressions. You have a bunch of regular expressions and you use the usual longest match strategy to recognize tokens in your input. In this specification, what is specified is not the tokens, but what is between the tokens, how they are delimited. Okay, fine, we'll see if that's a problem. And then it's actually not really token recognition, it's more pre-token recognition. It's some form of token that is not the token that the grammar is using. So you will have to turn them into real tokens at some point, which is an additional source of complexity. And if you look at the usual, easy part, new line characters, they are actually very difficult to handle because their interpretation depends on the parsing context too. And maybe the one very difficult monsters are the escaping sequences because they really depend on very complex nesting of searcher and double quotes. So let me show you an example. First, about token recognition. In any sane language, the first line will be recognized as, I don't know, maybe five different tokens, right? In shell, that's just one token. Okay, fine. On the third line, there are actually four words and one operator because you have spaces here, there. Oh, no, not that one because it is between quotes. Okay, okay. Oh, no, not that one because it's between parenthesis. Okay, okay. And two other spaces here, so they're delimited input. Okay, I'm not afraid of that because all these, the good news is that these stuff can be expressed using lex specification, not in the usual style but still. It's feasible. So okay, I'm not afraid for now. Let's talk about new line now. Actually, new lines have four different interpretation in shell. Okay, four different interpretation. On line one, new line is actually a token. It's equivalent of semicolon. It's the one that terminates the sequence on which you are iterating. On the second line, the new line is part of the command, so it is ignored. Okay. On the third line, it is preceded by a backslash, so it's a line continuation. Okay. And finally, on the line five, it's the end of phrase marker. So you have five, no, four, sorry, different meanings for new line. And when you are designing your lexical analysis, you have to have a very smart logic to understand which new line characters must be ignored and which one should be transmitted to the parcel. Okay, okay. It's a bit complicated. Maybe you want to escape. But before that, I have a quiz for you. Who knows which of these comments will output two backslashes? Okay. Raise your hand if you think it's the first one. Okay. The second one. Okay. The third one. Okay. You are an expert. It's the third one. What? Six backslashes? I don't understand why. I can explain you why this is the reason, but I need one hour to explain that. So I won't do that. But now take this escape sequence and put it between backslashes. Backward, sorry. That shouldn't change the semantics of this escape sequence. Of course, oh no, that does. So that's what I meant by the fact that escaping depends on the nesting of subchairs and double quotes. Okay. Now let's talk about this promotion from pre-token to tokens. At some points you have a word and you want to turn it into a reserved word. Okay. And actually this promotion depends on the passing context. What? Oh yes, you're right. When you write the first line here, four is at a position where four is expected to start a loop. Okay. But of course, when four is just after less it's an argument, you can have a file named four. So it means that a word will be promoted to a reserved word if the passer expects this reserved word here. Okay. Fine. But actually it's a bit more complicated than that because sometimes if you have a reserved word where it is not expected, you don't get a word. It is forbidden. So you have a lot of irregularities like that that constraints the parser with ad hoc site conditions. Okay. Fine. But I have some things that will just kill you now. Syntactic analysis isn't decidable because of alias. So alias is a command that works like preprocessor if you want. It's a macro, okay, that is expanded just before syntactic analysis. So it means that if you write a text, a test like this, depending on the execution of four, then mystery will be defined or not and the program will be syntactically correct or not. That's not your choice. So yes, I've hidden that from the moment but actually parsing depends on evaluation. Okay. Because this language is designed to be used in a way you parse, you evaluate, and you parse again and so on. But when you have to parse a script file, you have to evaluate your script. Okay. Okay. So at this point, you may wonder if it is even possible to write a shell parser. And of course it is. There exists in the world programmers with large skulls. Okay. And this guy is able to deal with that kind of complexity. For instance, if you open the source code of dash, you will find 1,600 lines of uncrafted C. Okay. So do the parsing with their bare hands. Okay. So don't use code generators or anything like the architecture you find in a book. And if you look at bash, it's based on a YAC grammar, which is totally different from the one that you can find in the standard. But it is also extended with 5,000 lines of C. Okay. And you get something like that. I don't want you to read that, but that's just a special case to under four in the code of dash. Okay. You have this very complex control flow, global variables, bit mask, and no backtracking, but sometimes something like looks at it. I don't have a scale large enough to be able to maintain that kind of code and to trust it. Okay. So that's not the way we followed. We tried to continue with traditional architecture and to just make a variation on top of it because we really want our parser to be modular and to use as much as possible code generators. Okay. So we have something a bit different in which you have a prelexer that is implemented using lex, something like lex, and a parser which is generated by also code generator called Menir. And we implemented the parser, the engine as an interaction between the lexer and the parser. So now the lexer transmit tokens to the parser, but the parser is also able to transmit its states to the lexer. Okay. What does that mean? Well, first we are very proud of something. We were able to take the yaggram of the standard and cut and paste it in our development and build our code around it. So we are very confident that we are capturing what the people that wrote the specification are trying to convey in this specification. As I've said, our prelexer is generated by a standard or chemical lex specification, but the key ingredients that make our parser modular is in fact a feature of Menir which provides us purely functional and incremental parsers. And thanks to that feature of Menir, we are able to implement very key parsing techniques in our parser. The first one and the second one, we get them thanks to the fact that we are purely functional. So speculative parsing is the idea that you create a parser that will go in the future in the remainder of your input and maybe crash, but if you don't care, you will die, but just before it dies, it will give you some information about the future so that your real parser will be able to exploit this information. And the longest prefix parsing is the other way around. When you have at some point an error, you are able to roll back to a parser of the past to backtrack on your choice. And all this is feasible because the parsers are purely functional. Also, our parsers are incremental which means that we can call them and they will just do one step of parsing analysis and give the hand to the code so that we can, for instance, implement lexers which are functions which are parameterized by the state of the parser and also at every state of the parsing analysis we can insert some introspection of the parser state in order to deal with the side conditions that I'm talking about in a modular way without polluting the rest of the specification. Okay? So I don't have enough time to show you the code for these parsing techniques, but what I can show you now is the shape of the interfaces of the parsers that are generated by many. So usually, if you use, I don't know any parser generator like Bison, for instance, you will get a parsing function like this. It takes a lexer or something like a lexer and when you invoke this function, it will take the hand and go through the parsing analysis in one shot until it is able to go to the end of the input and produce an abstract syntax free. So you can't interrupt it. In many what you get when you use these interfaces is actually a checkpoint which encapsulates the state of the parser. Okay? And this state is actually an algebraic data type. So you have many cases that describe, that classify the shape of the state and you also have an environment that contains everything that is needed for the parser to execute. Okay? So thanks to that, you can interact with a parser with only two functions. The first one is the offer function that takes a parser, a token, and produce you a new parser that takes this token into account without distracting the previous one. That's the whole point of purely functional programming. And also you have this other function with you that just said to the parser, oh, I'm okay, just continue one step further. Okay? And that's the only two functions that you have to learn to use a system, which is quite simple. Okay? So it's easy to implement backtracking as it's a system because you can keep as many copies of your parser states as you want. Okay? So to conclude, so we implemented this standalone program called Morbig, which turned your shell script into syntax-free, represented in JSON. We are able to parse all the scripts of our corpus, which is nice, and this is pretty efficient too. So you may ask, do we trust Morbig yet? Of course, no. I'm pretty sure that there are a lot of bugs inside. Okay? And what we want is to be able to refactor the code and to write it in such a way that the mapping between the specification and the code is as clear as possible so that all the experts can read the code and assess that we are using the right interpretation of this specification. So thank you for your attention. I hope you will not have too many nightmares this night, and we really need bug reports. So please use this depot and give us some feedback. Thank you. One question. Yes? Why do you use a scanner-fool parser algorithm like LR? And on the scanner-less, parser algorithm like parser algorithm? So the question is... LR? Yeah, I understand your question. I know the... This is a way of parser algorithm. Yeah. And not a scanner-less parser algorithm Yeah, yeah. I really have me to repeat the question. That's my obligation, okay? So the question is, why do we use separate legs on parser and not a scanner-less approach in which you have this lexical analysis inside your grammar specification? This is the case for ASF, SDF, for instance, I know this framework in which you use that kind of specification. Actually, we really want to use a YAC specification because that's the format in which the specification is written. I mean, the official specification. That's one point. And also, I really like my tools. Okay. Thank you. So...