 Welcome back. I hope you had an excellent break. Our next talk is a special topic, imagine code, creating code. It's always a really fun topic. So welcome, Kirill Borisov. Hi, Kirill. Thank you. Hi. So I am a little bit nervous. It's my first talk on a major conference and totally in English. So fingers crossed, it will go not as awful as it could. But still. Don't worry, don't worry. So where are you streaming from? I'm streaming straight from the Netherlands. It's quite rainy here. And weather could be nicer, but still I like it here. Okay, great. So you're really talking about automatically generating code? Yes. It's my passion because I like everything code related. And I try to create things that try to read and modify code in the spare time. So once I started doing that, I inevitably crashed right into a code generation. So let's get started. So let's get started. Thank you. Okay. So greetings. I'm Kirill Borisov, and I have more than 15 years of program experience under my belt. 10 of them are actually tightly mixed with Python. I'm creator of PyBattery code reformator and black connect PyCharm plugin. And as I've said, I'm in love with everything code. I like writing code, reading code, modifying code, fixing code, whatever. So about this talk, what will you know from it? We will talk about how code is written. We will cover a little bit of parsing. So how it is translated into some machine level representation. We will introduce Hypothesme, an excellent tool to do random Python code generation and we'll dive a little bit deeper into how it actually works. How is it possible? So let's get started from a basic code. What is it really? Code is, as you know, our bread and butter. We get paid to write code, to fix code of other people, to make it execute, to make it execute faster, etc. And code is usually written by hand by regular people. We have an idea, we translate it into paper and then we translate it into the file in the computer and then it magically gets to execute. But as we people are formal, we also need not only to read it, but also check it. And we can't really trust other people to check it because, as I've said, we are fallible. We're just mid-sex, we have emotions, we have fatigue, etc. So the whole cottage industry, or so-called linters, appeared. It's a set of tools, many of them that you know, and it actually does things like checking your code for complexity and other things. Before we get into them, let's remember how code is actually translated into a machine level representation. So here it is, maybe not the simplest possible program, but the simplest to understand and dear to everyone's heart, print hello world. But yes, it's a Python code, but how is it actually processed? First, it's broken down into so-called tokens. As you can see in this string print is a token of a type name. The left parenthesis and the right parenthesis and also semicolon are so-called op. And the hello world itself is a string. As you can see, we already have some understanding about what actually goes into this string and how this code is structured. But it's too low-level to actually do anything useful for our understanding. So then we apply the grammar. Grammar describes how the source code of a language is actually translated into high-level structures that can be used to process it into something that machine will understand. So here it is, an excerpt of Python free grammar that describes the top-level structures, so-called file input statement, simple statement and small statement. As you can see in this context free grammar, we go down. So we have a statement. It can be either simple statement or compiled statement. We go to the simple statement. It can be small statement, spirited by semicolons from other statement, one or more of them. And it usually ended with a new line. And the small statement in this turn can be one over whole set of other types of statements like expression, print, del, and things like that. So if you combine the stream of tokens with a grammar, you'll get something like this. It's basically a low-level structure that determines how the code is grouped. So as you can see, we have a file input that consists of end marker and simple statement. Simple statement is actually consists of power and the trailer and the new line. And all of them can be made directly onto the token stream. After that, the machine actually does some transformations and creates so-called abstract syntax tree. That is actually the high-level structure that interprets the workspace to translate your code into the byte code that gets executed by the interpreter. And this is a level at which most of the code checking, formatting, and modifying tool work at. Let's get back to the linters and auto-formatures. So as it has been mentioned, they read your code. And what they do is check your code style, they check your code for security issues, for the complex issues, et cetera. Some can actually modify it and introduce new constructions into it. Many of you know PEP 8, PyFlakes, Black. Those are just the most well-known examples of these kind of tools. There are many, many more, and maybe some of you actually written some of them. But the question that gets to everyone who actually wrote at least one of those tools, how do I actually check them? So how to check the checkers? The usual way to go is to use handcrafted examples. So you try to imagine which kind of code will actually trigger the checks that I'm interested with and which kind will actually do not trigger them. So you try to imagine as many permutations of these code examples as you can. When you write them down, when you say, hey, those should trigger the checks, those should not. And everything is OK, yes. And for the most part, it's really OK. But in this, you are limited by your imagination, and code can be an expressive tool like no other. So imagine that someone writes code in a totally different way than you are used to. Someone writes all their code in one string, or someone uses eight spaces instead of tabs, although it does seem a little bit crazy. And maybe, just maybe, your tool will break if you encounter code like that. The real world can really surprise you, and I know that from experience, sadly. So what can we do about that? Well, we can try to step outside of our head and try to imagine how I can get the most random code possible. So what you can do is just take a random set of characters as an input, and filter them by the criteria that it actually compiles, and it will work. Sure, but as anyone who read about infant monkey theory knows, it can take a really long time, like a lifetime of our universe, even if our computer was the size of the universe. I don't say that it can't reach the compiler, for example, in your lifetime, or maybe even in a year, but still much too much for your regular CI build. So it is highly practical in terms of time, and sadly, it cannot really be used for our purposes. So what should we do? Well, here comes the tool called procedural generation. As Wikipedia tells us, it's a method of creating data algorithmically. So we have some kind of algorithm which we can apply to the process of generating random data. So it's not completely random, it has some kind of structure inside it. And structure is the king. So what we need here is actually rules on how to arrange things inside the random code we generate. We need some kind of patterns for generating things that are correct and valid for this specific programming language. So for example, identifier should also all start with a letter. It doesn't mind which letter, but a letter still, etc. And it needs to cover the whole of the language. It wouldn't do if we cover only a specific subset of the language. If we are trying to go for broke and test something completely random, something that we'll never think of on our own, well, we'll need to try harder. And that sounds like a grammar, isn't it? Yes, it's actually something that will help us. A grammar. A grammar is a structural representation of the things that can be in the language, how they can be arranged, how they should look, if we are to be considered valid by the language itself. So we should really just go and use it. The grammar itself, as you have seen, can be represented as a tree. The rules, so-called non-terminals, will be a node of the tree. And the text, the actual text that will go into resource code, so-called terminals, will be a leaf on the tree. So what we'll need to do is just do a random walk through the tree, just go from the top of the tree down to the terminals, and just concatenate everything that we'll get from these leaves in order. And with a high probability, you will actually get code that will compile. To help us with that, we'll have a tool called hypothesis. Many of you also probably used it already to generate tests or to generate data sets for your tests. Using so-called property-based testing, it generates a wide range of input data. For example, if you have a function that accepts two arguments, integers A and B, you will try to call them with 0, 100 million, minus 1, et cetera, et cetera. It will not test the whole possible range of the values, and it's very important. But if we'll try to test for the whole possible range of values for a given data type, we may never end because the integers, they have a finite number of them, at least in computer representation. But imagine if we'll need to test all the possible strings of characters that we can generate. Yeah, quite a long time, isn't it? So based on a quick check paper, what it does, it tries to generate examples and then tries to generate interesting examples. So it tries to move for the whole possible space of values that can be used, and try the ones that will pass as approved by our tests, and then it will try to do so-called shrinking. Basically, if it finds a string of 1,000 million characters that breaks your test, it will try to make it 1 million characters, 10,000 characters, 1,000 characters, et cetera, just because it's much easier to represent and then reproduce. To do that, it uses so-called hill climbing search, which is outside of the scope of this chalk. Basically, it's a method to actually navigate with many-dimensional space of possible values. So just an example, we'll use something from the hypothesis documentation. It's a test that checks that we actually can create a file with all the possible branch names for some kind of repository. So we have a valid branch names function that uses the thing called strategy that is supported by hypothesis to actually generate random text. We specify the restrictions on that random set of characters. It should consist only of the letters, alphabetical letters. It should have minimum size 1 and maximum size 112. Also, they should all be lowercase, or we can also taste the specific case of master branch. And then the hypothesis will take our test function, test checkout new branch, and we'll try to generate branch names and give it one by one. Possibly some of them will pass. Maybe some of them will break the invariant and actually break the test as well. And that is actually what I do. But on itself, hypothesis cannot really generate the random source code using its simple strategies like text mentioned here. To help us with generating source code for a specific language, we need to use more advanced thing in the hypothesis called large strategy. Large strategy actually uses a large parser. It's a parsing toolkit for Python that you fit in the context free grammar and it generates the code to actually parse the source code described by this grammar into some kind of a representation. So large strategy basically takes grammar, in our case, Python free grammar. It generates its representation in the machine readable format, basically a tree. And then what large strategy does, it basically walks the tree as on each level, it selects a subset of the nodes on this level and walks into them and does that recursively until it reaches the terminals. And the terminals, the actual thing that will be generated and including our source code will be represented as a regex. And it will try to take a regex and generate a string that will actually be matched by that regex and then it will be considered as an acceptable terminal. But not all languages can be generated like that on the WIM because Python is very quirky. As you know, for example, aside from many different languages, it uses invitation to mark blocks of code instead of the pairs of parenthesis. Or for example, identifiers must be UTF-8 encodable. Otherwise, the compiler, the interpreter will not recognize them as valid. And there's a lot of AST post-processing. So the code that is described by the Python 3 grammar is not necessarily the code that will be recognized as valid by the Python interpreter itself because it will do a lot of post-processing on what is actually generated. Or for example, we have a new PAG parser in Python 3.10 that will actually do interesting things, which is again, at the scope of this talk, but we can talk for hours about it. If you're interested, you can read a series of posts by reader Juan Rosso, in which he walks you through writing as simply as possible a PAG parser for Python itself. So to help us with this, there is a tool called Hypothesmit. So it is inspired by the Seasmit. The Seasmit is actually a tool that has been used by the researchers to test the C compilers for bugs. It generates valid and random C programs. And why not do that for Python for the creator of this tool? And what it does, it exposes the strategy that can be used with the hypothesis. So it builds on the large strategy. And that strategy has a number of post-processing mechanisms that smooth over all the quirks in the Python code that we can generate. So for example, it post-processes the generated code in terms of idents, generating the correct ones. It also tries to compile each example by the Python compiler and then throws away the ones that actually do not compile although they are valid from the grammar standpoint. And it also has an experimental support for per node generation. So instead of generating any possible program that can include any possible subset of all the productions in the Python grammar, you can ask it to generate the if expressions in all the possible forms or imports, et cetera. It is a little bit experimental in my opinion at this point and it uses libcst, but it has a big potential in my opinion. So how does it actually look in practice? So let's look at this source code. This source code uses hypothesis to generate our source code. But first, we are setting a special set of options for the generation or random examples by hypothesis because generating source code examples can be very slow. Not slow like lifetime of a universe, but slow as 5-10 minutes. And by default, hypothesis will say that no, this test takes too long, I will just fail it. So we tell them that no, there is no deadline, take as long as you can and do not look at the high number of discarded examples because as we've said, quite a lot of the examples that will be generated will not actually be compilable. But the actual meat of the matter you can see in the next test, it's a generated source parameter that passed to the test and it's generated from hypothesis.fromGrammar. We are also telling it to generate a maximum of 1,000 examples just for the sake of a demonstration. And instead of checking this code for anything, we'll just try to print it out to see what is actually being generated by hypothesis. So in the beginning, it will be something quite simple. So in July, perfectly valid Python source code to new lines. Okay, I can write that new line and new line. Okay. If a... Oh my God. If a... Oh my God. But where are spaces here? Where are like no spaces here? Is it really a valid Python code? Actually it turns out yes. I for myself will not in any way consider any Python programmer a person who will write without any white spaces because it's actually hostile to other people. But for purposes of testing, this is exactly what we need because as I've said before, we need something what we people will not consider to be possible or even needed. And if you give it some time, it will graduate two more complex examples like this. Yes, again, perfectly valid Python source code. It does look a little bit strange, but it will compile and it will actually trigger some bugs possibly in your code because did you ever imagine that there can be no white spaces in your code as well or that non-local variable names can be a set of unicode characters, not only eski ones, etc., etc. And if you give it some time, it will generate something completely unreadable. But still, it's a valid Python source code and it's still valid for purposes of testing. Maybe exactly this will encourage some bug in your parser code or maybe your tool will choke on this and you will know that, hey, something went wrong. I need at least to see if it's actually applicable to the regular code written by humans or not. To help with a better example generation, HypotoSmith also used a target search feature of a hypothesis. So we use metrics to find better examples to guide the search for random data in the space of all possible examples. The targets as used by HypotoSmith is a number of bytecode instructions, total number of AST nodes in the resulting source code tree and the number of unique AST node types. So for example if we have a program generated that has like 1,000 print instructions, then possibly we need to do something to make it more varied in terms of what constructions go into the generated code. So by using these targets, HypotoSmith will move through this space and generate a longer and more complicated code which is a nice feature to have. So you will say, are there any actual bugs found by HypotoSmith? Yes. And as you can see from this list and this is not a complete list, bugs that it found can actually be traced to the Python interpreter itself for example, Python-first-sec-fold which actually stopped the release of Python 3.9-alpha-1 and Sec-fold as you may well is very, very serious matter because basically it breaks everything. There was also tokenize, untokenize around 3 bugs, so for some reason if given a specific example of code if you give it to tokenize and give the resulting string of tokens to untokenize, it will not generate the same text that was passed to the tokenize which is again not what was expected of the code itself. There's also issues in the mobile known packets like lib223, black, libcst and I do think that if you try it on your code, on your tool you may find a bug as well. For example I came to know about HypotoSmith when its creator actually came to one of my projects and said, hey your tool does not process this specific piece of code correctly and I'm like, where did you get this? Nobody writes code like that. And then I discovered HypotoSmith and yes, it encountered a serious bug in my tool and I had to fix it because while you may say this example was far-fetched, the one that was generated and looks mostly like gibberish it still could have impacted more unusually written code by actual human being. So as I said there are some caveats to this approach. Most generated code is gibberish as you've seen for yourself and it can only serve as a smoke test. Smoke test is a term that comes from electrical engineering. It's basically when you've assembled some kind of a device and you plug it to a power source and if everything is assembled correctly everything happens except for maybe it's not working as you expected but if you, for example made some errors and there's a short circuit, you will see smoke coming out of some parts of your circuit and that indicates that something went wrong and you should really look closely to what you have done and here it serves the same purpose. So it will crash your code, possibly in place you never expected it to and that will serve as an impetus to actually look into what is going on and does it actually trigger a real world bug and do you need to fix it or not. If you haven't done that, possibly it could have a snake under the hood and trigger something bad happening in the real world. There's actually no support for AST post processing as well as I mentioned many overcode examples that otherwise can be considered valid from a grammar standpoint will be thrown away because we do not have a specific understanding of how to generate code that will trigger those post processing rules but it is an area for improvement and possibly one day this tool or some hour will actually look into that problem and learn how to do that and it will improve the consistency of the generating process considerably. Also, it can be quite slow mostly because it will spend some time searching for examples that actually compileable due to a previous point and many of them will be thrown away and it can take quite a lot of time especially since if you try to compile any type of code that you it will take resources needed to parse this, to process this, transform this and in the end to determine that it's actually compilable. So if you are really interested in this and I may understand if you're not but for those who are you can read a little bit more to understand how it actually works and how it can be applied further. First I would highly recommend two papers finding understanding bugs in c-compilers that describes the c-smith and how it actually works. Next is a quick check. It's basically a property-based testing tool that's described in this paper in Haskell yes I know Haskell but still it gives all the necessary base of information for you to understand how this random example generation is actually guided and how it works and how it can be applied to your program. And all this very famous book compilers, principles, techniques and tools by Ahoseti and Ulman that is considered to be the textbook on the parsing techniques on how to write your own parser for any language that comes into your mind. It's a little bit heftier like one thousand something two hundred pages, but still it's a very gripping read and I cannot recommend it enough. And also if you're really just interested in how to apply random testing to your code then you can also look into the articles that describe the hypothesis and specifically a good starting point is the high hypothesis works on its website. To better understand what you can do to improve the efficiency of a code generation how it can be applied to your specific test situation. So, any questions? Thank you very much Kirill, for your first talk at a big conference it was very good so there is one question we don't really have too much time but I still ask this question Will hypothesis and other tools mentioned here work in Python 3.1 any pack? That's a very good question so technically they would because we're using the Python grammar that is not attached to a specific parsing method used by the Python interpreter as I've said it actually triggered some bugs in the pack parser already so we can say that it works and I think that maybe some purpose will come further if we dive into the creation of code examples that trigger the code transformation and specific pack parser features later but at this moment it uses on the grammar to generate code samples so you are good. Okay, so unfortunately we are out of time but if there are more questions I'm sure you can check the chat so sure Thanks again Thank you for having me and maybe see you at the physical conference next year I think it's crossed Okay so our next talk has a very