 Okay, yes. Hello. Okay. Hello. So together with Jan, we are going to talk about the problem of passing shell, shell scripts. And in my part, I would like to explain you why we are interested in this problem. And then Jan will explain to you what we have experienced on our journey so far. Okay, so what we are doing is we are having a project, a research project over five years where we are trying to analyze Debian maintainer scripts. So Debian is part of the research project. And it's important that we have these applications of what are maintainer scripts. Most of you know what it is. These are pre-inst, post-inst, pre-RM, post-RM scripts which may exist in binary packages. You may have none of them. You may have all four of them or just a combination of them. And we think it's very important to analyze the scripts that we have in these packages. Why? Well, the first thing is they are executed as route every time you install a package, you de-install a package or that you upgrade a package. And obviously, Alex talked already this morning about what's happening when you are running on alias, any arbitrary shell code. Well, this stuff is executed as route on every machine where you have Debian installed. So obviously, we should be quite sure that these are not doing any stupid things. Part of the problem is that these scripts might be executed in different contexts. And contexts mean the collection of packages which you are currently installed on your machine. And what these scripts are doing might depend on stuff which currently is installed on your machine. This is part of the reason why these scripts exist at all because they are part of the stuff you cannot put into the package because they are not static. They depend on the particular situation of your machines or different situations. And in all situations, the scripts have to do the right thing and nothing, nothing wrong. Another source of the problem is the fact that the packages are not living in isolation, in isolated silos. We have packages which share infrastructure with other packages. Think, for instance, of tech. Think of Emacs. Once you install a package which contains, for instance, a tech add-on or an Emacs macros, these are compiled or installed on directories which have been installed by other packages. So packages do not live in isolation. And we think that we need automated tools to analyze these scripts. This obviously cannot be done by hand alone. And just to give you an idea about the kind of problems that we are intending, hoping, that you hope to be able to find automatically at the end of the project. I cannot promise that this will be the case. But if you are on the QA mailing list, maybe you remember this bug report two months ago. Someone complained that he could not uninstall the sent mail-based package because there was a bug in the post-RM script. And the thing I would like you to remark here is that this particular version of the package was in the archive for more than two months. And the bug is at least as old as that and probably older. And it's not really a rare package. It has a popcorn of almost 3,000. And one might wonder why this bug has not been found before and reported before. Because this seems to be quite an obvious problem. And the reason is that, in fact, the bug occurred only in a very particular situation. The situation was the following one. In fact, this was the offending line in the post-RM script. It was a find invocation with flag minus size 0. And all the files found by this find invocation would be removed by the x-args minus rrm. So minus r is the option to x-args and not to rm. And in fact, obviously, the maintainer here was assuming that a directory can never be of size 0. Because in that case, of course, the rm would fail. It would not remove a directory. And in fact, when you try it on your machine, most of you have an ext file system. In fact, you will find that a directory always has a size of at least one block. So one can understand why one would make this assumption. However, it's not always true. And the unfortunate user who was experiencing this bug, he was here at his EDC on a ButterFS file system, which is a kind of overlay file system with copy on write. And it turns out that on this particular file system, it is possible when you create a new directory that it has size 0. And this explains why two months after the package was installed in the archive, he dropped on this particular bug, which, of course, once someone found out what happened, it was quite easy to find. So the point I would like to make here is that testing is not enough. So this you probably would not find. I think PyOparts has tried to install the package and to de-install the package. Obviously, the bug was not found because it occurs only in a very, very particular situation. Okay, so we are trying to analyze the maintainer scripts. So this is a kind of an old analysis I did at the end of 2016. At that time, we had almost 32,000 maintainer scripts in the archive. At that time, we had already more than 50,000 binary packages. And the vast, vast majority of these are POSIX shell scripts. Well, there are about 300 BUSH scripts, which really have written BIN BUSH. They probably should be POSIX shell anyway. And a few POSIX scripts and even one ELF executable, but almost all of them are POSIX shell scripts. So at the beginning of our project, obviously, the first building block of our tool chain would be to construct a front-end, which would read the POSIX shell scripts, translate them into a syntax tree, and then we would use this as the first building block in order to have after this all the formal analysis stuff, which would be able to find the bugs. So our first step was to build this parser, and at the beginning of the project, honestly, we thought that this would not be too difficult. So we expected to have the front-end after maybe one year of the project with a nice specification of UNIX commands and everything. And it turns out that it took us much longer than that. And the reason for that is, in fact, Jan, who will now explain to you why this was so difficult. Okay, so let's pass this on. We don't have the same head, I think. Okay, does it work? Okay, cool. So thank you, Ralph, to let them believe that they will attend a regular technical talk so that they are confident and calm for the moment. So as Ralph said, the first goal of our project is to write a positional parser. But as this positional parser will be used, integrated into a static analyzer, we really have to be confident in this implementation. We have to trust it, okay? Because otherwise, we can't trust the final results of the entire tool chain. So the question that I will talk about during this talk is how to write a positional parser that you can trust. But actually, the real message of this talk is that if you start answering this question, you will start a journey in a world which is not different from hell. Okay, so I will try to explain why during this talk. But let's start with the beginning. So when you know that you will have a difficult journey, you first try to be prepared. So you open a book and you try to listen to the wisdom of your ancestors. Because you know, for 50 years, a lot of computer scientists studied this problem of parsing and they said to us a very important message. Parsing is a very difficult programming task sometimes. And as it is a difficult programming task, it has to be decomposed into simpler tasks. And it gave us a very beautiful architecture. Okay, so one that says if you want to build a syntactic analyzer, you first start with a lexical analyzer, a lexar, which will turn your stream of characters into a stream of tokens, a stream of what is really relevant in your input. And then you will write a parser that will take this sequence of tokens and recognize some structure inside it using a grammar and build for you if the input is syntactically correct, an abstract syntax tree. Okay, so that's the beautiful architecture that I've been here for 50 years now. And it has been so well-studied that actually people came with very nice declarative language that you know, lex specification and BNF specification. That helps you define lexical analyzers and parsers with very declarative languages. And furthermore, what they gave us is a way to turn these very high-level specifications into code. Just like when you use a compiler and you trust it to be able to turn your high-level language into some assembly code, it is possible to only have to reason about these specifications and just let the code generators do their work to give you code that you can trust. Okay, so it's a very beautiful framework and everyone wants to, when you have to define a compiler, you really want to use that one. So of course you start with classical architecture and you go in another book, the POSIX chat specification which is defined by the open group. And, well, by just looking in a very high-level way, you first discover a YAC grammar inside. So it's a real moment of happiness. You say, oh, the job is almost done. There is a YAC grammar inside. I will just have to cut and paste it into my code generators and then I will go to holidays. But actually it's not exactly a YAC specification. It is a YAC specification annotated by site conditions which are actually totally out of reach in terms of expressiveness of what you can do, what you can recognize with LR parsers. So it means that you can't really use, at least, this YAC specification. Okay, so you have to enter the details to actually read the text and then you understand more and more than actually the specifications is really low level. It is sometimes contradictory. It is unconventional. You've never seen that before and also informal. Okay, so actually you can't really blame the guys that wrote this specification because the actual truth is that the language itself is an absolute horror. Okay, it's a monster. It's a world of suffering. Okay, if you want to understand precisely what it is. Why? Well, I will explain that during this talk but basically the idea is that lexical analysis is passing dependent so that you can see the process as a composition of lexical analysis then passing but actually there are interactions needed between the two. And the grammar is actually ambiguous and even in general terms, in the general case, you can't write. It's undecidable to write a static parser for shell scripts. In addition to that, the specification is full of irregularities. You have a lot of special cases everywhere. So I will try to give you some examples of what I've just said is true and to start with, let's consider token recognition. So if you are a compiler writer, you are used to a definition of lexical analysis in which tokens are defined positively. So if you are given a regular expression for identifiers, for keywords, you have a list of the keywords, you also have regular expressions for some literals and so on and so forth. So it's very nice because you can just cut and pass the regular expression to your lex generator and use the common longest match strategy of this lex tool to define your lexical analysis. In shell, it's totally different. It's the other way around. Tokens are not defined positively but negatively in the sense that what you have is how to delimit your tokens, what are, what is between each tokens. Okay, that's not really difficult but it's just unconventional. Fortunately what you can do is still to use lex specification to define that and it kind of works. But there is another difficulty. Normally the lexical analysis is defined as, as I've said, a function that takes your characters and produce the tokens that will be consumed by the grammar. Grammar is expressed with these tokens. But actually in the case of the shell specifications, that's not true. What you have after lexical recognitions is not really token. It's what I would call pre-tokens. It's a classification of a text chunk into two categories, words and operators. So you will have to do some post-processing to get actual token but I will talk about that later. Also, there are some easy parts normally when you write a lexical analyzer. Typically new lines are just ignored. Comments are easy. Escapes are just, you know, a backslash and something else. In the case of shell scripted, everything is complicated. Even the semantics of new lines. I will come back on this using an example in a few minutes. But let's consider this example about token recognition. In any sane language, on the first line you would detect, I would say, five tokens. In the case of shell, it's just one token. But okay, fine. It's how it is specified. Let's continue. In line two, well, you have to separate tokens by considering space. So here you have a space. Okay. Here you have another one. Here you have, oh, wait. No, this one is between double codes. So it's not a space that delimits tokens. Fine. Fine. And you have another space. Oh, here, no, it's not fine, too, because here this space is in a sub-shell here. So also it's not delimitating pre-tokens. So fine, it's not a space that delimits. Here you have another space which delimits token, and here a final one. So it means that here you have actually five tokens. But okay, what I've just said, it means that you have to write your lexical analysis by considering some form of context to define if a space is delimiting or not your tokens. But that's fine. You can do it. For the moment, it's just complex lex specifications that you will get. So it means that if it is complex, it is not simple. And I really love to trust simple things, not complex one. But okay, I will trust that part if I look at it for a very long time. But let's consider a new line. Now, actually in the shell scripts, you have four different interpretations of new lines. Four different interpretations of new lines. Repeat it again. Four different interpretations of new lines. It's totally crazy, okay? So let's consider this example. On the first line, for instance, you have a new line at the end of the line, and it is meaningful in terms of the grammar. This is a token, actually, that appears in the grammar, so you have to convey it to the parser, and it is important. It is delimiting the tokens that end the sequence on which you are iterating over. Then, of course, you have new lines in comments that must be ignored. You also have this new line preceded by backslashes. It means that you have a line continuation. It's just a way to continue to the next line. So it's purely lexical. The grammar is not asking something about that. And finally, the backslash end at the end is the end of a phrase marker. So it's another token, again. So it means that you have to define a very smart logic to understand in your lexical analysis which new lines are actually to be ignored, which new lines are token, and which one of these tokens is the good one for this new line character. At this point, you may be a bit frightened, so you may want to escape. I don't know. Maybe you want to escape. Before that, I do have a quiz for you. So you know Dash. Dash is a debiant shell, so you're all expert of Dash, right? And so you can tell me with no computer which one of these lines outputs two backslashes to the standard output. So please raise your hand if you think the first one will output two backslashes. Don't be shy. Thank you. The second one? I don't count. The third one? Yes, you're the expert in the room. I won't explain why today because I don't have one hour. It doesn't take one hour, but it's a long explanation after all. But yes, you need six backslashes to output two backslashes. Okay? In Dash. Yeah. It's in POSIX shell, actually. It's compliant with POSIX. In Dash, yes. Not in Bash. Bash and Bean Echo is also does not have the right behavior with respect to POSIX too, without respect. Okay. But now imagine that you put a back quote around it shouldn't change. I mean, it's just running a sub-shell, right? So what will happen here? Syntax error. Okay. This one, it really takes one hour, and we are not really sure how to explain it. Exactly. So that's what I meant when I said earlier that escaping depends on the nesting of sub-shell and double quotes. Okay. Nobody wants to leave the room? I mean, okay. Okay. So I can continue. So I've said earlier that actually after the lexical analysis, what you get are pre-tokens. And what the grammar needs is tokens. So what you have to do at some point is to take words and promote them to keywords or to assignment words. But I won't talk about assignment words because I don't have time to do that. But actually what I want to say is that this promotion depends on the passing context. So it's maybe clear, it will be clear with this example. I'm not the author of this example, Ralph Trannon with his twisted mind device, this example, which was a bit weird maybe, but it currently illustrates that a keyword, I mean, something that looks like a keyword can be interpreted as something that is not a keyword, depending on its place in the output. Okay. So for instance, for here, this one, the first one is, of course, a reserved word, you're starting a loop. But this one is just a word that is part of your sequence on which you are iterating over. Okay. So it means that the same for in and for do and so on. That's a syntactically correct program, by the way, you know. Of course. So when you are in your lexical analyzer, you have to call the passing context to introspect it, to observe it, to define if you can promote a word into a reserved word. But not all the time. Sometimes some words, some words that are actually, that could be interpreted as reserved words, like else, cannot appear at some very specific position. So you have a lot of irregularities in the specification, a lot of special corner everywhere in this very dark world of positional language. So never name one of your tools being else. A user will never be able to call it. I mean, without the full path. Okay. So it means that in your parser you will have some ad hoc side conditions everywhere to cope with all the irregularities of the language. But at this point, I see some eyes that are a bit frightened, but not that much. You've seen a lot already in your programmer life. But I have something that is like I don't know, the final boss, the one that will kill you if you don't protect yourself. So I want to warn you before I show that. So are you prepared for this last example? So the icing of the cake, actually static parsing is undecidable. You can't write a static parser for shell script. Why? This is because of the presence of alias. So alias is like macro, if you want some form of define of macro definition. What you are defining with alias is a word that is substituted by some string just before syntactic analysis is applied. So it means in that example that depending on the existing status of foo, this script is syntactically correct or not, that's not your choice. It's foo's choice. So you can't decide. So you will say, it's normal because usually shell scripts are syntactically analyzed and evaluated just phrase by phrase and so I'm able to evaluate this program using bash or something like that. But in our case, we really have to be static. We can't evaluate scripts. So we can't say if the script is syntactically correct or not. Okay, you are pretty calm. Maybe it's because we are just after the lunch. But you may wonder at this point if... So yes, parsing depends on evaluation. So at this point you may wonder if it's even possible to write a shell parser. Okay? But actually, of course that's possible because there are very smart people in the world that were able to tackle this problem. I mean, if you look at the parser of Dash, it's 1,600 lines of uncrafted C, character level C. I mean, this guy, they took the textbook and they use it not to open it and to use it, but as a shield, okay? As a weapon. They don't use the textbook actually as a textbook. They just implement a parser with their bare hands. They are heroes, okay? The same is true for bash which is using some form of diagram which is totally different from the specifications and also an extra 5,000 lines of C. But actually it's for bash which is a larger language than shell. Okay. So what you get is some code like that, okay? So it's a glimpse of the Dash parser and I don't want you to read it and to understand it, but the idea is that what you get is a very low level parsing code which is using bitmask, global variables, character level parsing and a lot of things, some form of backtracking at some point and so on. So it's really, really impressive to be able to write down this code and to have it correct. I have a deepest respect for this programmer and I have not a brain as large as their brain. So I can't personally, I can't maintain that kind of code. I can't trust it because I can't maintain it. I can't understand it. My score is too small, okay? So I have to find a way to write a parser that is POSIX compliant and also simple in some way so that with my little brain I can handle it under its complexity locally piece by pieces, pieces by pieces. And that's why we went back to the books or to research articles and tried to do some more advanced magic in terms of modular architecture for parsing. What we do is a variant of the standard architecture for parsers that I've presented already and the idea is that we will use code generators as much as possible because we want high level code as much as possible. And what we will do is to orchestrate the interaction between these generated codes in such a way that this orchestration can be easily mapped to the specification. Okay, so we have an architecture in which we have a prelexer which is written by some Lex specification as usual which produces pre-tokens and then we have two modules which interacts in a way that the Lexer provides tokens to the parser and the parser is able to provide its states to the Lexer for introspection. But what makes this possible? Actually, it's the fact that we are using some special technology here. Okay, thank you. First of all, we use a YAC generator called Ménir which makes it possible to simply take the grammar of the standard and cut and paste it into our code and then build all the complex interactions that have already to deal with the already presented complexity of the parsing outside around this specification. So we are very proud of this. We are very close to the standard because we start with the specification of the standard. And the key ingredient that makes it work is the fact that we are using purely functional and incremental parser that helped us use very advanced parsing techniques which are called speculative parsing, longest prefix parsing, parameterized Lexers and parser state introspection. So I can't go into the details here but the idea of speculative parsing is the fact that you don't have a single parser, you have actually many soldiers and sometimes you take a parser and you ask him to go into the future in the future to read a little more of the input and eventually this parser can die just before it dies, it sends you by a message to describe what he has seen. And using this information you can decide what to do in your actual parser. We can do that because we are purely functional. So we are stateless and we share a lot of things between all our different checkpoints in the parsing process. So I think I will skip the other one. So I can't show you the code today but what I can show you is the difference between what you get using our generator menu with respect to the standard output that you get with Bison for instance. If you use Bison, what you get if you give him a YAC specification is a code that is basically a function that takes a Lexa and then execute and when you give it the control it will execute and consume all the input to produce an entire abstract syntax tree if the input is actually syntactically correct. You can't interrupt it. With Menier we have an alternative signature that allows us to do some interreptible parsing process. Actually when you are using a parser generated by Menier what you get is a checkpoint. When you execute the parsing function you get a checkpoint which corresponds to a single step in the parsing process. Then you can take the state that you get at this point and you will know what's with a very beautiful sum type. You will know in which case you are in the parsing analysis so that you can make an interaction as I've shown you in the previous diagrams between the parser and the Lexa and the Lexa can react to each step of the parsing process so that to, I don't know for instance, promote words into reserve words if the parser is compatible with that. Okay, so that's the details. I will skip this. So what we get for the moment is a standalone program called Morbig which is able to turn a shell script into some syntax tree represented in JSON. It's actually pretty efficient. We were able to parse the corpus that Ralph described in nine seconds on my laptop. So it's pretty efficient. And so you could say, oh, so you're done. No, we are not done at all because there are of course a lot of bugs in that parser. I'm pretty sure that we are missing some incorrect interpretation of the specification that there are some incorrect tree construction in the process and so on and so forth. So we are really in size this journey in the hell and what we try to achieve, it's a state in which we will have the specification on the code and the code will be, I level enough that the mapping between the specification of the code could be explained and shown and documented so that an expert of the specification could say, oh yeah, you did exactly what I was thinking when I wrote this piece of informal human natural language text. So that's okay. Or sometimes since that's more likely, it will say, oh no, you interpreted it wrong. So our goal is to be able to really extract some knowledge to the experts by being able to explain our code to the expert and to do that we have to be a level. So that's the end. I hope that you will not have too many nightmares tonight. I thank you for your attention and at some point in June we will have the first release of our tour. So if you are brave enough, try it, give us some feedbacks and thank you. Bye-bye. If you have any questions. Thank you for the talk. Are there any questions? So thank you for an interesting introduction to this. How does this actually is going to determine or detect this send mail post removal or pre-removable issue? Oh, you want to... Okay, that's of course a very good question. So we seem to be and in fact we are still quite far from this final goal of being able to find such a bug. So the parser is of course only the first element of our tool chain. And of course what has to be done in the rest of the project is to building on the once we have constructed abstract or concrete syntax tree of a script to implement the tool which will do a symbolic execution of the script and construct precisely what this script is doing. So all the analysis which is going to happen still has to be done. The parser is doing nothing of this kind. It's just doing a syntactic analysis and what we found so far with the parser is in fact we found a few bugs and shell scripts but these would be considered quite trivial. They are quite trivial syntactic bugs or wrong invocations of commands and we find bugs for them but these are not the bugs we are aiming at. Of course in the end we would like to find interesting stuff like the bug I showed before so interesting semantic bugs of the script but we are still far away from this. Okay, far away but we have started... I mean we already have some idea how to proceed. But no tool yet. Any other questions? I don't have actually a question. I just have a compliment. This was the most sickest shell code I have ever seen in years. Vous êtes malade monsieur. Malade? Thank you, I guess. Given how difficult it is to parse shell would you be using a different language for our post instant post-IRM scripts? Yes. I think so but in fact the question comes up from time to time on the deep end mailing lists. I think yes. We need something better but the question is of course what and obviously the question is how can you balance simplicity of the language with the need for expressivity in some corner cases. So sometimes you really have to do complicated stuff and then you need quite powerful language. And the question is how can you reconcile this with the need for something which is really simple and declarative and easy to understand. So it's not obvious. The short answer is yes we need a better way to write maintainer scripts but it's far from obvious how this language could look like. Could you define a subset of the POSIX specification that would allow you to check the scripts? A subset of the specification that is compatible with all the already written maintainer scripts. So what we did is some statistical analysis using this parser and well there are some patterns that come back every time and there are surely some corner case of shell scripts that are not used by programmer because they are not all crazy. So yes it's likely that there exists a subset of shell that capture almost all the scripts and the remaining scripts should be rewritten anyway because they are too complex to be maintainable. So the pathological case that Jan showed to you the alias which happened conditionally in two cases so this does not occur in the corpus of Debian maintainer scripts luckily. So we were very glad about that and in fact our Morbid parser as Jan explained this case cannot be tweeted statically so in fact the Morbid parser would refuse such a shell script. Okay and we used in fact the parser to do some statistical analysis. We don't have prepared anything to show you the results of this today but we hope to be able to show it to you at Debian corpus this year in August where we will come again I hope and present results of the statistical not static statistical analysis of the corpus of shell scripts. Next.