 Hi, welcome back to analyzing software using deep learning. This is the third part of this module on using hierarchical neural networks for analyzing software. What we want to do in this third part is to look at the second application of hierarchical neural models, which will reason about code changes and specifically try to find embeddings or vector representations of code changes so that then you can make predictions about these code changes. Again, this is based on a recent paper on this tool called CC2Vac, and if you're interested in more details, you should of course look them up in this paper. Let's start with a bit of motivation. Why do we actually want to reason about code changes and why do we want to represent code changes using a neural model? The reason why code changes are important is because source code of successful projects is evolving all the time. It's not that you write code and then it stays where it is, but things are changing all the time. You're adding more features, fixing bugs, supporting new environments, and so on. Now, if you can reason about these code changes and if you have a way to represent code changes using a neural model, we can use this to make a number of useful predictions. Some of these kinds of predictions are also what the approach that we are talking about here is actually useful and we will talk about those toward the end of this part of the course. Specifically, for example, you can try to predict what the commit message for code change should be so that edit developer doesn't really have to type the commit message each time, but it's predicted automatically by a model. Or you can predict whether a code change is actually fixing a bug, which may be interesting to identify specific code changes that you want to apply somewhere else. Or you may want to predict whether a code change is actually introducing a bug. Of course, no code change is introducing bugs on purpose, but of course, some developers may do this sometimes by accident and if you would have a way to predict whether a code change introduces a bug, then you can allocate more testing or more quality assurance resources to this specific code change. All of this relies on some way to reason about code changes and how to do this using a hierarchical neural network is what we want to do here in this part of the course. As a concrete example, let's have a look at a code change in the Linux kernel, which has been committed by someone a couple of days ago. So what you see here is basically a representation of this code change as you would see it on GitHub. And what you can see is that the code change essentially consists of two things. One is the commit message, which is a description of this code change given by the developer. And specifically in the approach that we are talking about here, what is really used is just this very first line of this commit message. What you also see is the actual code change. So the pieces of code that have changed in this commit. And as you see here, this may consist of multiple files. In this case, it's just one file. And then in each of these files, you have a couple of possibly a set of changes, which each consists of some lines that are removed and some lines that are added. Sometimes they're just lines removed, sometimes they're just lines added, but in general it's added and removed lines. One more piece of terminology. So these lines that are added and removed and that are consecutive like these here and then also down this here and this here are each called a hunk. So in this example, we have three hunks, one for this change, one for that change and one for this change down here. Let's now have a look at how a neural network can make sense of these code changes and basically reason about all these different pieces of information. So the approach that we are talking about here is called CC2Vec, which stands for code change to vector. And I'll start by giving an overview of how this approach looks like. So the input or the entity that we are reasoned about here is a commit and this commit, essentially as we've seen consists of two parts. One is what we'll use as the input to the model and that's the actual code change. And the other is what we'll use as the output of the model and this is the words in the commit message. Using parts of the data that we have, in this case the words in the commit message as the output of a model is kind of a trick that is played pretty often in these applications of neural networks to code, where we're basically using something that we have anyway, in this case the commit message associated to a code change as our training data which will help us to find the representation for what we use as the input which is in this case the code change. Now as we've seen a code change may affect multiple files. So actually there may be one file here and then another file here. And so on for some set of K files and for each of these files we will have a hierarchical attention network which is one kind of hierarchical neural network as we discuss in this module of the course. How exactly this hierarchical attention network works we'll see in a second. What it gives us is a vector representation of the specific file. So this is basically a vector representation for file one. And then we will do the same for the other files which gives a vector representation for each file and all of these vector representations are then concatenated into one big vector representation of the entire code change. So basically here we do the same for the other files which also gives us vector representations for these files and then we're just concatenating all of this into one big vector representation for the entire code change. Given this vector representation of the entire code change what the model does is to predict what words occur in the commit methods corresponding to this code change. And this is done using a feed forward fully connected neural network. So again we have some layers which at the end predict a vector. And what we have in this vector is basically all the words that may occur in a commit message. And what we'll get out of this feed forward neural network here. Let me just write this down. So this is just another feed forward network as we've already seen in a couple of times in this course. And then what we'll get here is this word vector. And this can be interpreted as a probability vector that tells us for every word that might occur in the commit message how likely is it that this word actually occurs in the specific commit message of the given code change. So essentially this is the probabilities of words occurring in the commit message. Before looking more into the details of this neural network let's have a look at the data that is actually extracted and how exactly this data is extracted. So as we've seen every code change consists of a set of files that is affected. It may just be a single file but in general there may be more files. Now for each of these files the approach extracts the different hunks that are affected by the code change where just as a reminder a hunk is a consecutive sequence of lines of modified code. So you may have multiple of these hunks in a file with basically different code locations in this file are changed. Now each of these hunks can be seen as a set of added and a set of removed lines. And this is also the way this CC2Vec approach looks at these hunks. And then finally for each of these lines we can see them as a sequence of code tokens by just taking all the tokens in this line that gets removed or that gets added. And using all this data and by breaking down the data in this way we have basically a hierarchy of different things that we want to reason about. So we want to reason about individual code token summarize them into a line then reason about the different hunks which consists of lines but then we want to summarize all information of a hunk into one vector do the same by looking at all the hunks that are in an affected file and finally summarize everything again into a single representation of a code change. And as you can see this is actually a hierarchy and that's why using a hierarchical neural network is a really good choice for this kind of problem. So now we know what data is extracted from these code changes. Let's now have a look at how we can make use of this data using a hierarchical model. So as we've seen a file that gets changed has multiple hunks. Each of these hunks has multiple lines that are changed and then each line consists of multiple tokens. So let's just look at this input by knowing this structure that I've just described. So we basically will see that there's one token here and then some more tokens, let's say up to TK and all of those is what is one line in our code change. Now if you have multiple of these lines then this set of lines is actually one hunk and then if you have multiple of these hunks then these hunks together correspond to all the changes for a file. And now this hierarchical structure of the input data is represented in the hierarchical model that reasons about these different parts of the input. So specifically what we'll have is different sub models that summarize different layers of this hierarchy. To summarize the tokens that occur in a line there's an RNN that basically takes the sequence of tokens and then summarizes the sequence into a vector which is called the line vector because this basically contains all the information or all the important information in this specific line of code. Now given multiple of these line vectors, so from this line here we will also get a line vector from the same RNN, the different lines that occur in a hunk are also given to another RNN. So this happens to be the same kind of network but this will be a different model that has different weights and is trained to do a different job basically. And this second RNN also produces a single vector again which then summarizes everything that happens in this hunk. So we call this vector the hunk vector. And now because there are multiple hunks or there may be multiple hunks in a given file we will have multiple of these hunk vectors. So there will be a different hunk vector for this hunk down here. And these different hunks are now again summarized by feeding all of this into yet another recurrent neural network which then outputs a single vector that summarizes all the changes in this file. So this is called the file vector. Now in the model all of this happens twice for every changed file namely once for all the lines that are added in this change and once for all the lines that are removed for the change. So basically all of this is done twice here so that at the end we get one for added code and also one file vector for removed code. One thing that is special about these RNNs and that is different from the basic RNN that we've seen in an earlier module of this course is that they also make use of a so-called attention layer. So they essentially learn to pay more attention to particular parts of the input. So for example this token level RNN that takes these different tokens here may have to pay more attention to say the first token than the second token because maybe that token is more relevant for understanding what this code change is about and what elements of the input sequence to pay attention to that is something that these attention based RNNs are also automatically learning. So there's basically another weight matrix that at the end compute some weights for every element of the input sequence and the vector in this case the line vector that is produced as a result of all of this is awaited some of the influence that these different tokens in the input actually have so that more attention is paid to more important tokens so that the line vector basically summarizes the relevant information and ignores some other information. So this is called an attention based model and how exactly this works is beyond the scope of this part of the course but if you're interested there's a detailed description in the paper we're talking about here. So what this hierarchical model now gives us is one vector representation for the code that is added in a file change and one vector representation for all the code that was removed in a file change. So here on this slide I call these two vectors EA for all the added code and ER for all the removed code. Now what we want to do next is to focus the attention of the model really on the changes in this file and the way this is done in C2Vec is by having a set of comparison functions that basically take these two vectors EA and ER and compare them in different ways. We won't go into all the details of these comparison functions. There are five in total that are described in the paper just to give you an intuition. One of them is for example to do an element-wise subtraction. So we take EA minus ER and then use whatever comes out of the subtraction as a representation of the change. And at the end we get one vector that summarizes all the changes in this file by basically summarizing what is added and what is removed and how these relate to each other. Now what remains to be done is to combine all this information and use it for the final prediction and this is what we look at now. So what we are given here is one vector for every file that is changed as part of this code change. So this will be the vector for all the changes in file one and then we will have more of those for other files that were also changed. And now what the model does is to simply concatenate these vectors. This is possible because usually there aren't so many files that get changed at once. If you would have hundreds of files here then of course you would need to think about something else but in practice most changes affect just a few files often very often just a single file. So this concatenation doesn't really do anything in that case. And then what gives, what we get out of there is one vector representation that summarizes everything we know about this specific code change which is then given to a feed forward network which may have a couple of layers. So this is very similar to what we've also seen in the typewriter model. And this feed forward network eventually predicts the output that we have already seen a little earlier which was this vector that contains one element for every word that we might possibly have in our commit message. So this is the word vector. And essentially what we have here is one element for every word that we have in our vocabulary. Of course, looking only at the top and most common words because otherwise we would have a very, very long vector here. So this is one element per possible word in a commit message. So essentially what this model then finally predicts is which words might be in a commit message. And by doing this it needs to summarize and learn how to summarize everything it sees in this code change so that it's able to make a good prediction. For example, if the commit message contains something like fix then the model needs to have some understanding of the fact that this is actually a bug fix because otherwise this word probably would not show up in the commit message. Okay, so you know how the architecture looks like. Let's now have a look at how this model gets actually trained. So as usual for training, we need some data and in this case the data is gathered from the version control system of some project or in reality from many version control systems of many projects where we basically get a history of all the commits that have happened in the history of this project. Each of these commits is represented as a pair and this is a pair of a code change. So basically the actual change of the tokens and the code and the commit message which is then used to generate the input and the expected output for this model. The paper describes a detailed evaluation which looks at tens of thousands of different such pairs and these pairs are used to train this entire model. The model is trained jointly so as I've explained in part one of this module, the entire module so all the sub models in it are trained jointly with these end to end input output pairs such that all of these weights and biases that we have in these individual neural networks are optimized for this ultimate task of predicting which words we have in a commit message. Now, the main purpose of this approach was not just to learn to predict the words in a commit message. Actually, this is not super useful because it only tells you which words exist in the commit message but not for example in what order they should occur. But the actual purpose of this model is to use the vector representation that is that the model is forced to pass all this information through as an embedding of code changes. So basically this one vector representation that we have at the very end just before the feed forward neural network this is the vector representation that all of this is about because this represents in just one relatively short vector the content of this code change and then this vector representation can be used for other downstream applications. So just to make this very clear let me go back to this previous slide one more time. So this is the vector that I'm talking about because this summarizes basically everything we know about this code change and this is the vector that is then used as the embedding of the entire code change. So now that we have these embeddings for these code changes so basically a vector representation of every code change question is what can we do with them? In this paper three applications are explored further and for each of them there's an evaluation that shows that this actually works. We won't go into the details of all these results so let me just summarize these applications quickly. One of them is to actually predict a commit message of a code change. Now this does not work by just using the model because as I said it does only predict whether a specific word does exist or not exist in the commit message but it doesn't really give you the words in the right order. So instead what is done here is to take the vector representation that you get for a code change and to then search in existing code changes and their commit messages for the nearest neighbor of the code change. So the underlying assumption is that in all these existing code changes there's some code change that has a very similar maybe even the same commit message as the one that we wanna have here. So let's try to find this code change and then use its commit message as the commit message for the new code change. The second code change is about predicting whether a code change is actually fixing a bug. Now why is this relevant? There are many reasons why this might be relevant. One, and that's the motivation discussed in this paper is that sometimes you want to backport bug fixes to other parts of your code base. So for example, in the Linux kernel there are always older versions of the Linux kernel that are not extended with new features anymore but for which you still want to apply bug fixes. For example, if some security vulnerability got fixed and in order to decide which of these code changes to backport to these branches of old kernel versions you first need to find out which code changes are actually bug fixes. Now in the Linux kernel there are thousands of code changes. So doing this manually is bound to lead to some mistakes and developers are likely to miss some of these bug fixes. And this is where this model comes into play because it can automatically predict which code changes are bug fix. The way this works is that you basically take this vector representation that you get from CSU to Mac and then train another model that tells you based on this vector representation whether this is a bug fix or not based on some already annotated code changes that you can use as a ground truth for learning this other model. And then finally, the third application is what is called a just-in-time defect prediction. So just-in-time means that this basically happens right after someone commits a code change to a repository. And defect prediction means that the model predicts whether this code change might introduce a bug. So knowing when a bug is introduced into the code is super useful of course because then you can allocate more quality assurance resources to this code change. For example, if you're in a larger organization where code reviews are done for every code change you may wanna add another code review or maybe even two or you may wanna run additional tests to make sure that any bug that is indeed in this code change is caught early on and does not end up in your code base. All right, and this is already it for this third and last part of this module on hierarchical neural networks. So we've now seen what hierarchical neural networks are. You know that they are used for inputs that you can decompose into different parts and where you may wanna use a different kind of neural network for these different parts of the input. And we've seen two applications, one on type prediction and now this one here on reasoning about code changes that use these hierarchical neural networks for analyzing software. That's all for today. Thank you very much for listening and see you next time.