 Hi, welcome back to analyzing software using deep learning. So we are in the second module, which is about recurrent neural networks and how to use them for code completion and program repair. And this is the second part where we'll now look into the first application of recurrent neural networks. So what we had done in the first part was to look at RNNs and how these models actually work in general. And now we look at one concrete applications of them, namely for code completion with statistical language models. As most of the things we talk about in this course, this idea is based on a paper that appeared relatively recently. So actually this is one of the earlier papers in this whole space because it appeared at PLDI 2014. The concrete application that RNNs are used for here is code completion. What is code completion? Well, it's something that you've probably already used when you have programmed in an IDE. So given a partial program, a program in which some pieces are missing, the goal is to find some code to fill in these missing pieces. So your program has some holes and you want to have in this case a model that predicts what to fill into these holes. In most IDEs you have at least the basic version of this code completion. So if you press the right shortcut in the IDE, it will, for example, propose the next method to call or maybe one variable to use. What code completion is about here is slightly more advanced than what you get in most of today's IDEs because it is not just about fitting in individual tokens, but it wants to fill holes with sequences of method calls. So specifically the code completion system wants to tell you what methods to call next and what arguments to pass to these methods. So let's have a look at a concrete example, which in this case is a piece of Java code that contains a couple of statements, but is still missing some code. So if you look at this code and know Java a little bit, you may figure out that this is code for the Android platform because it's using some APIs that are commonly used in Android. And then in this if statement here, there are actually two things missing, namely what to do if the if condition is true and what to do if the if condition is false. And filling in these holes h1 and h2 is exactly what the code completion system is supposed to do and specifically it wants to fill in these holes with method calls that come also with the arguments that you are passing to the method. So now to fill in these holes, the approach that we are talking about here is using a so-called statistical language model. So let's have a look at what such a statistical language model actually is. It typically consists of a couple of things. One of them is a dictionary of words which basically tells us all the words that we have in our language. Then there's a notion of sentences which unsurprisingly are sequences of words. And then there's some model which helps us to predict what sentences are normal in this language by basically defining a probability distribution over all possible sentences. So for every sentence, this model is telling us how likely the sentence is. So for example, let's say our language is English. Then if we ask this model about the probability of a sentence consisting of hello and world, then probably the sentence has a higher probability than a sentence that has the same two words but in the other order because it says world hello. And by knowing how likely these two sentences are, you can basically tell which of them is probably the correct sentence. So there are many ways how these models could be defined. We here focus on the most basic kind of model which essentially predicts the next word based on all the previous words or it can tell you the probability of the next word based on all the previous words that we already have in the sentence. So mathematically this means the probability of a sentence is given by multiplying for all the words ranging from I to M. The probability that we would pick a word WI given some information about the past which similar to the RNNs, I'm here calling H of I minus one. And specifically here S is this sequence of words W1 to WM and HI are all the words up to the point in time I. So this idea of a statistical language model is not specific to programs or to software. And the question is how can we use them for code completion? So the basic idea is to use a model that can predict the likelihood of sentences by basically formulating programs as sentences. So what we do here is we will look at program code as sentences in a language. And then if you have this notion then code completion essentially becomes the problem of finding the most likely completion of the current sentence. So the existing code is the current sentence and by completing this current sentence we are essentially completing the code. Now this is an interesting and very general idea but there are of course a couple of challenges if you want to make this a reality. So one of them is how to actually abstract source code into sentences. There are many ways how one could do this and we will look at one specific way here in this lecture. Another interesting challenge is what kind of language model should we use and we will see two answers to this question and then actually see that the combination of these two answers works best in practice. And finally there's the question of how to efficiently predict a completion because you want to use a code completion tool typically in an IDE where a programmer needs a tool that is quicker in suggesting how to complete the code than the programmer would be in actually writing it herself or himself. So let's now have a look at how the approach we are talking about here which is called SLANG is addressing these challenges. So here's an overview of the approach which I've shamelessly stolen from the original paper at PLDI 2014. So at the core of all of this is a statistical language model that we have just briefly talked about. Then the approach has two main phases. One is a training phase during which this model is trained to basically learn the language that it's modeling and then once we have trained model there's a query phase where we're using the trained model in order to provide code completions. For the training phase we need some training data set which basically consists of code examples which are abstracted into sentences and we'll see how this works. And then in the query phase the approach is given a partial program that has some holes which are again abstracted in the same way as during the training phase which is giving us a sentence with holes. And then we can look up in the language model what candidate sentences it suggests to fill into these holes. Then we can combine this with some additional constraints that we get from the original program and then at the end provide some completions to the developer. So let's now look deeper into the individual components of the approach and let's start with the language model because this is really the core of the approach. I already mentioned that two different language models used in slang and one of them is actually not a deep learning based language model but it's an engram language model. So what is an engram language model? I've previously talked briefly about this idea of modeling a language by looking at all the previous words in the sentence in order to predict what the next sentence will be or what the next word will be but there's a problem with these general history models. And this problem is that maybe you haven't seen anything that started exactly as the sentence that you're being provided in your training data. So let's say you have some previous or some partial sentence HI then maybe your training data does not really have any information about what comes next or maybe has just a small number of examples that are not enough to make a good prediction. So the problem here is that the training data may not contain anything about a given HI so a given previous sentence a sequence of words. Now the idea of an engram model is to not look at the entire history but to say that the next word only depends on the n-1 previous words. So in a sense this is basically looking at a limited history in order to make a better prediction because by looking at only the n-1 previous words there's a much higher chance that you have seen these n-1 previous words somewhere in your training data. So with this formulation of the problem the probability of a given sentence now looks as follows so it's again the product over words ranging from I to M where for each word we look at the probability of this word WI given the previous words but only the n-1 previous words. So let's have a look at a concrete example and before we look into examples from programs let's use an example in English simply because it's easier to understand. So let's say we want to predict the probability of the sentence 2B or 2B and let's say that our n for the n-gram is 3 so the model predicts every word looking at the 2 previous words before that word. So in this case the probability of this entire sentence will be the product of the probabilities of all the words in the sentence each time looking up to 2 words back if there are any words of course. So we will have the probability of 2 given the empty word because there's nothing before the 2 so there's nothing to look back at times the probability of B given the beginning of the sentence which is 2 times the probability of or given the 2 previous words so 2 and B times and so on until we reach the end of the sentence where we look at the probability of B given not and 2 So this product includes all the words in the sentence but for every individual word looks back only n-1 so in this case 2 words So now how do we know the probability of these so called n-grams so basically of the previous words followed by the next word and this probability is what we get from a set of training examples which helps us to estimate basically for every n-gram how likely it is So for this example of English this training corpus could for example just be a set of English texts or maybe Shakespeare texts and for programs of course we want to have a training corpus that consists of programs abstracted that we want them to abstract in order to get sentences By looking back only to the n-1 previous words an n-gram model of course has a very limited view of the past which it can use to predict the future and this limitation is addressed by the second kind of language model that we want to discuss here and this is an RNN based model So this model looks like a recurrent neural network like an RNN that we already have seen earlier in this course just that now the input is the previous word in the sentence and then what the RNN predicts us is the next word in the sentence and because this is an RNN that has this recurrent connection up here it is able to store information about all the previous words in practice of course the capacity of this hidden layer H that is used to store information about the past is limited so what will happen is that the RNN needs to make some decisions of what information to keep and what information to throw away but essentially it can store all information about the past bounded by the capacity of the hidden layer of course how do we encode these words the previous word and how is the next word predicted so here we need some encoding into a vector and what is used in the slang approach and also in many other approaches is the most simple encoding that you can basically come up with called a one-hot encoding later in this course we will see some other ways of encoding parts of programs or also natural language words into compact vectors but for now let's stick to this one-hot encoding so what does one-hot encoding really mean it means that the length of the vector is equal to the size of your vocabulary and all values in the vector are 0 except for the position of the word that we want to represent so every word in our dictionary has one position in this vector everything is a 0 but we have one specific element set to 1 in order to say that this is the word that we want to represent so this looks something like this that we have this usually very long vector of 0s and then somewhere in there we have this one specific position that is set to 1 because this is basically the word that we would like to represent and this kind of vector is now given to our RNN and then the RNN predicts the next word and then the next time step gets this next word as it's input and predicts the next next word based on the previous words that it stores in this hidden layer H so now to use any of these language models based one or the RNN based one we need to represent source code as sentences so we need to define what the words are and how to compose sentences from them remember that we want to predict sequences of method calls in these missing pieces of code so what we'll do here or what the approach does is to abstract code into sentences by basically saying that every method call is a word and a sequence of method calls for each object to be a bit more specific the approach is using separate sentences for every object so basically for every object that occurs somewhere in the code a separate sentence is constructed and objects can occur in different kinds of roles within a method call so an object can occur as the base object of a call it may occur as an argument to a call or it may occur as the return of a call and no matter where a specific object occurs everything belonging to this one object is put into a sentence and these sentences are then fed into our language models now the question is how can we actually get these sentences so these sequences of method calls from a program as in practically all program analysis tasks there are basically two options one is dynamic analysis and the other is static analysis let's look at this first option dynamic analysis dynamic analysis means that we are executing the program and then during the execution observe something and in this case we would observe all the method calls that happen the big advantage of dynamic analysis is that we get precise results because we see exactly what is happening during the execution and do not have to approximate anything the big disadvantage is that we can only analyze what actually gets executed and if there is some part of our code corpus that never gets executed we will basically not see anything about this part of the corpus let me illustrate these advantages and disadvantages using this little example down here so we have an if statement that depends on some value we don't know because this is read from some input and depending on this value we are executing this branch now if this branch is always taken so basically if you always go here then what we will see is only this one call and our analysis will not extract anything about object.bar so it's missing some information that may be potentially interesting but just for our executions is never executed but at the same time the information we get is very precise because we get it from a concrete execution and know that whatever we see has indeed happened the other option for extracting information from a program is static analysis which does not execute the code but reasons about possible executions without really executing the code the big advantage of this kind of approach is that we here can consider all execution paths which means we can basically consider all possible executions that might happen in any possible execution but the big disadvantage is that there are some pieces of information we just don't know without executing the code and as a result we need to abstract and approximate the actual executions in some way again let me illustrate this with an example let's say here we have again an if that depends on the input so we do not really know whether this branch gets executed so we can or cannot include this call here and if we don't really know then probably we will include it here another uncertainty that we have is that we in this example for example do not know whether A and B refer to the same project so they are different variables that's clear but they may be aliases and point to the same object and as I said we want to extract all calls related to a specific object into the same sentence so if A and B are the same object then this should actually end up in the same sentence but if they don't then they should not but without really executing this code we may not be able to tell whether A and B point to the same object or not this challenge we are facing here that we are not really able to precisely analyze all possible executions of the program is a general dilemma for program analysis that basically occurs everywhere automatically analyzing a program and the result of this dilemma is that you basically always have to choose between over or under approximation and we'll see in a second what exactly this means so to explain what I mean by these terms let's suppose we have some program P and some input I or some inputs I and as a result of running the program with this input we'll get some behavior which I'll denote with P of I now if you think about the space of all possible behaviors of a given program then there may be some behavior P of I1 there may be another possible behavior P of I2 and yet another possible behavior P of I3 now what we would like to see ideally in a program analysis is this set of all possible behaviors so this is what we would like to analyze ideally but as it turns out this is not really possible as I've already illustrated with these simple examples on the previous slide what we instead have to do is we have to decide whether we want to over or under approximate this set of possible behaviors and one way to approach this problem is to just look at the subset of these possible behaviors which is what is called under approximation and this is essentially what most dynamic analyses are doing so these analyses are executing the program with some finite set of inputs and as a result see some finite set of behaviors which may not cover all possible behaviors and as a result they are under approximating the set of all possible behaviors the other option is to basically go in the other direction and to consider some behavior that is actually not possible so there's some behavior here that is not really possible but nevertheless considered in this screen set and this is what is called an over approximation and this is what is done by most static analyses so now given these two options of under approximating with a dynamic analysis and over approximating with a static analysis SLANG is taking the second option so it's running a static analysis to extract call sequences that are then used as sentences for our language model specifically the static analysis works as photos so whenever it sees a loop it is bounding the number of analyzed loop iterations so even though the loop may run more often in reality the static analysis is assuming that it runs at most some defined number of times whenever the control flow joins so basically places in the code where different flows of control come in and the control may come from here or may come from there then what the static analysis does is to take the union of all the possible execution sequences that it has seen on both of these incoming flows of control and then I briefly mentioned this problem of determining whether two variables alias each other so whether they may point to the same object at run time and what the static analysis used in SLANG here does is to use an existing points to analysis that reasons about references to objects and basically tells us whether two variables may point to the same object or not and using all of these approximations we eventually get a sequence or a set of sentences which correspond to a set of sequences of method called extracted from our code corpus so let's have a look at the information that is extracted by the static analysis and for that purpose let's look at this example that we see here so in this code there is an if so we do not really know an aesthetic analysis whether this condition will evaluate true or false so what the analysis does is to consider both cases the case where we will execute this statement here and the case where we directly go after that statement because there is nothing here in the else branch so now given this example what the analysis will extract are five sequences of method calls that correspond to different paths through this code and to different objects that are at the core of every sentence so let's look at them one by one so the first two are centered around this SMS manager object here and the reason why we have two sequences is because we do not really know whether this second statement in which SMS manager is involved actually executes so there is one sentence for the case where this if is evaluated to false and in this case we have exactly one call SMS manager is involved in which is a call of the get default method where the SMS manager object is the return object the second case here and then there is another call of divide message where SMS manager occurs as the base object which is represented by this zero place in the parameter list here and then there are two sentences for message which are basically similar to what we have just seen and then there are two sentences for message which are basically similar to what we have just seen so it is about this usage of message and that usage of message the first one again for the case where the if is not executed so we just see this call of length where message occurs again at position zero so as the base object and then for the second sequence of message we also have this call of divide message where message occurs as one of the arguments and that's why the position here is one which corresponds to the first argument and then finally we have another object here namely this message list which occurs only once as the return value of a call to divide message so we have another sentence like this so basically each of these sentences here consists of a sequence of calls and each of these calls consists of a name and a position where the core object of this sentence occurs in so now once the analysis has extracted these sequences of method calls they can be used to train our statistical language model in the evaluation of the paper that this part of the course is based on they use 3 million methods from various different android projects so it's a pretty large and then extract sentences so sequences of method calls from each of these methods using aesthetic analysis and then these sentences are used to train the statistical language models that we talked about so both the anchorer model and the unn based model is trained using that data so once this model is trained we can then use it for querying the model to find out how to complete a partial piece of code so what we are given is a method that contains some holds and into these holds we want to fill in some method calls so what the approach does is to go through each of these holds and for every hole consider all possible completions of the partial call sequence that is already in the given method in reality it's not really going through all possible completions as I'll tell you in a minute but for now you can basically assume that it is considering all possible completions and now for each of these possible completions so basically each of the ways the existing calls can be complemented by additional calls the approach is querying the language model and then gets a probability that this is the right completion the approach is using two language models the n-gram based model and the unn based model and what the people who did this work found out in the evaluation is that the average of both the prediction of the n-gram model and the unn model is actually the most effective in finding the right completion so what the approach that eventually returns is a completed piece of code which adds these method calls such that the overall probability as predicted by the language models is maximized so let's have a look at how this works by going back to the example that we have already seen earlier so here we again have this piece of android code where we have these two holes that we would like to fill in and now what the approach does is to take the existing calls in this code like this call to get default or to divide message and then ask the model to complete this started sentence with the most suitable words or the most suitable method calls in this case what it would predict is that if the if branch is taken so basically if this divide message call is also executed then what we should fill in into the first hole is this call to send multi-part text message which essentially is splitting a larger message into multiple text messages and it also tells us at what position in this newly added call to use this variable message list namely here as the second argument likewise for the other hole that we have where the if branch is not taken it also predicts a method call but because in this case this call to divide message does not happen before it has a different prefix of words or method calls that have already happened and therefore the language models also make different predictions and here predict that we should call send text message and again tells us where to use one of the variables that we already have here namely this variable message so as you can see this idea is able to predict the right method calls to be called here of course it does not always work like this but at least in many cases it does so I've already hinted at the fact earlier that querying the language models for all possible extensions of a given sequence of method calls may not be feasible simply because there are too many of these possible extensions so in order to keep the time that the code completion approach takes to suggest the code completion to a developer in reasonable bounds the approach comes with a couple of refinements of the approach that we have seen so far so one of these refinements is that a user may provide hints about what kind of method calls to fill into the holes so one of these hints is about how many different calls to insert and another one is about what objects to possibly use in these method calls and this reduces the search space so that the language models have to be queried less often another refinement is to replace some infrequent words so some infrequent method calls with a special word unknown which basically means the approach is not at all reasoning about these infrequent method calls it is also not able to predict anything about them but at the same time significantly reduces the search space and then the final refinement is about using a simpler language model as a pre-filter to decide what queries to ask the other language model with and this simpler language model is a bi-gram model so basically an n-gram model with n equal to which means we are taking the previous call and then just ask what is the likely next call and out of these likely next calls the full queries for the language models are constructed so that only a subset of all candidates that would otherwise be asked are given to the language model and as a result of all these refinements it is possible to explore this space of possible completions in reasonable time and provide co-completions faster than a developer can actually type the code herself or himself alright so this is already it for the second part of this module on an n-based approaches for analyzing software so in the second part you have learned how to use RNNs for code completion and what we will do in the third and last part of this module is to look at another application where we will see how to use RNNs for repairing syntax errors thank you very much for listening and see you next time