 Hi, and welcome back to analyzing software using deep learning. This is part three of the module that looks into recurrent neural networks and how to use them for program analysis tasks. And specifically what we want to do in this third part is to look into another application of recurrent neural networks. And this application is about repairing a specific kind of error in programs, namely syntax errors. Again, this is based on recent research published in a 2016 paper, which of course gives many more details than I can give you in this course. Let's start by having a look at the motivation for this work. So what we are given is a program that has a syntax error, which means that the source code does not fully comply with the grammar of the underlying programming language. What we would like to have is a fix that removes the syntax error by modifying the program in such a way that afterwards it does comply with the grammar of the language, while still looking more or less like the original program. There are many scenarios where you might want to have an approach that automatically fixes such syntax errors. The specific context that has motivated this work that we are talking about here have been MOOCs, so massive open online courses, where you have submissions by maybe hundreds or thousands of different students that all submit different pieces of code and you would like to give automated feedback on how to remove some of the errors that may exist in the student code and specifically how to remove syntax errors. To fully understand this task, let's have a look at one or two examples. So what you see here is a piece of Python code that looks almost correct, but has some small mistake. So if you stare at this for a few seconds, and I'll just give you some time to fully look at this. Then after a while, you probably realize that something is wrong around here because there's something missing. And indeed, if we modify the program by adding this closing parenthesis, then it is syntactically correct, whereas before it was not. Now finding out that something is syntactically incorrect is relatively easy for well-defined programming languages. Essentially, you just need to pass the code and if this does not work, then you know that there is a syntax error. But how to fix the code is not always that easy. Let's have a look at another example, which again is a piece of Python code. And again, there is some syntax error. Looking at this may take a little longer, but if you stare at this for a few seconds, you probably see that something is wrong around here because there seems to be a return statement. But at the same time, there's an equal sign, which makes it look like an assignment, but this is not really the way return statements work in Python. And one way to fix this could, for example, be like this, where we are replacing this expression that we had before by just some variable that also happens to be used in the same function. And by changing the code in this way, what we'll get is a syntactically correct piece of Python code. Now what this example also illustrates is that not every fix of a syntax error may be the semantically correct fix. So this particular example, this is actually not the way you want to fix this code, at least based on the function name. This should probably do something else. But anyway, this is a syntax error that gets fixed. So from the point of view of the approach that we're looking at here, this is a sufficiently good way of fixing this. And then how to find a semantically correct fix is a question that is also interesting, of course, but out of scope for the work that we're talking about here. Let's now have a look at a learning-based approach to automatically find fixes for such syntax errors. The approach that we want to talk about here is called SYNFIX. And I'll first give you an overview of how this approach works. The SYNFIX approach takes two inputs in order to work. One is a piece of syntactically incorrect code. And because this work came up in the context of massive open online courses, this piece of code is actually a student submission that has a syntax error. Now because this is a learning-based approach, we also need some training data. And in this case, the training data that we'll use is a corpus of syntactically correct code and more specifically syntactically correct student submissions. Given such a corpus of correct code, the different ways how we could formulate a learning problem that helps us in fixing syntax errors, the approach that is used here is based on a recurrent neural network. And specifically, there's a learned N-based model that encodes some information about how to complete or fix code in order to get syntactically correct code. How exactly this model looks like is what we'll see in a minute. For now, let's just treat it as a black box. So there's some kind of model based on an N that says something about the correct code. And then this model, along with the incorrect code, is given to the SYNFIX algorithm, which then tries to find some feedback to be given to the student. And specifically, this feedback comes in the form of a suggested fix that fixes the code to make it syntactically correct. What's important in this setup is that it's not a complete end-to-end learning-based approach, because it's not just a model that automatically suggests the feedback. But instead, this learned model is part of an algorithm and is used as basically one component in a more complex program analysis. Let's now have a more detailed look at this N-based model that is one of the core components of SYNFIX. So since this is an N-based model, it reasons about sequences of things. And in this case, the sequences are sequences of tokens. So the program is represented simply as a sequence of individual tokens. Now intuitively, what the model does is to look at a sequence of tokens and then predicts the most likely next token in this sequence. So the way this works is that we have some input layer, some hidden layer, and some output layer. And now I'm going to show you the unrolled version of this N where you basically see the input and output and hidden layer for different steps in time. So let me just put some more labels here so that becomes easier to understand. So down here, this is the input layer. In the middle is the hidden layer and then on top we have the output layer. Now what is given as the input to this model is a sequence of tokens. So let's just use a specific example based on one of the examples that we've seen before. So for example, this could be this sequence of tokens consisting of if, base, and equal equal. And now what the output or the expected output of the model is, is always the next token that is going to come in the program. Or if you don't know it, then the next token that is, or the token that is most likely to come next. So given if, we want the model to say, hey, the next token is base. Given if and base, we want the model to say, hey, the next token is equal equal. And then given if base equal equal, we hope that the model predicts one, because that is in this case, the most likely way of completing this code. So now the question is how to make the model make such predictions. And as usual in all these learning based approaches, there's the training phase. And what will the do on the training phase is to provide a lot of input sequences and output sequences that make sense because those come from the correct student submissions. Specifically, what we provide here is the expected output sequence for given input and this output sequence happens to be the input sequence shifted by one. So we basically just train the model to predict always the next token as we've seen it in the training corpus of code. Once the model is trained, we can move on into the prediction phase. And what we do here is that we provide a partial program, and specifically this partial program ends at the location where we know the syntax error to be. Finding out this location is usually relatively easy because if you give a syntactically incorrect program to an interpreter or compiler, it will give you an error message and tell you that there is a syntax error at a specific location. And now given this partial program that ends just right at this error location, we are basically asking the R and N based model to generate the next token. And then we'll use this generated or predicted next token for making a suggestion how to fix this code. And of course, this may not just be one token. So if we take the first token that the model suggests, then feed it into the model again and ask, hey, what is the next token after that? We can actually get a list of generated tokens that we can then use to find a fix. Now that you know how the R and N based model works, let's have a look at how the Synfix algorithm uses this model in order to find a fix for a syntactically incorrect piece of code. So what is given to this algorithm is the program with the syntax error and the error location. So the specific line and also your character where the syntax error is. At a very high level, what this algorithm does is to query the model with parts of the given code in order to find how the model would complete this code. And then it tries to insert this completion into the existing piece of code in order to get a piece of code that is syntactically incorrect. So more specifically what happens is that we start by parsing and tokenizing the program. So at the end we have a sequence of tokens. And then the model is queried using a prefix of these tokens. And specifically at first the model is queried with the prefix that ends at the error location. Now what the model will do is it will predict a token and then if we query it again with this token, it will predict more tokens. And then what the algorithm does is to try if we can either insert this predicted list of tokens into the code. Or if you can replace one or more tokens in the existing code around the error location in order to fix the error. If any of these two work and if we at the end get a piece of code that does not have the error anymore but that has fixed the error, then we are done and this suggested fix is given back to the user. What may also happen is that none of these suggested tokens to insert or replace the existing tokens with work. And what in this case happens is that the algorithm deletes the entire line that has the syntax error and then queries the network again. But now using a prefix that just ends at this error line. So basically it's the prefix until the error line because we have removed this line. Now what the model will then do is to again predict some tokens. Which could replace this entire error line. And the algorithm now sees if inserting these predicted tokens into the program and replacing this error line with the predicted sequence of tokens fixes the error. And then again, if this fixes the error, then this fix is given back as a suggestion to the user. All right, so this is all I want to say about this specific application of RNNs for fixing syntax errors. There are many more details about this approach and also a nice description of an evaluation. So if you're interested in how exactly this works or how well it works, I'll invite you to have a look at the corresponding paper. But for the purpose of this course, this shall be enough because it hopefully gave you an idea of how recurrent neural networks can be used for fixing syntax error. So what you should take away from this module on RNNs and the applications and analyzing programs is that RNNs are very powerful kind of model and that they are specifically well suited. If you have a problem where either your input or your input and your output consists of something that they can represent as a sequence. For example, if it's a program, you can represent as a sequence of tokens. We've looked at these two applications, one to complete code by adding method calls and one to fix syntax errors by replacing or adding a few individual tokens. I hope you enjoyed it and that you've learned something. Thank you very much for listening and see you next time.