 Hi, welcome back to Analyzing Software Using Deep Learning. This is a course at University of Stuttgart in summer 2020. What we'll do in this module is to look at one specific way of building neural networks, namely hierarchical neural networks, and how to use the models that you can build this way for analyzing software. In particular, we'll look into two applications. One of them is predicting types in programs written in a language where you may not have type annotations at all program elements. And the other one is reasoning about code changes so that you can make useful predictions about code changes in a version history. So here's an overview of this module. So we'll start by looking into hierarchical neural networks in general. So what are these networks and why do you want to use them and when do you want to use them? Then we'll look at the first application, which is about type prediction. And as usual, this and also the second application is based on recent research. So both of them are based on papers written just this year or that will just appear in this year. And if you are interested in more details, of course, you're welcome to have a look at these papers. All right, so let's start with a motivation. Why do you want to use these hierarchical neural networks? They are basically useful if you have an input on which you want to make some predictions. And this input is structured into multiple parts. There may be many reasons why this is the case. We look into some examples in a second. But these hierarchical neural networks are particularly useful if these multiple parts are basically too many to simply concatenate. So if you would just feed everything into one big feed forward neural network, for example, this neural network would be very slow and not be very useful. Also, these multiple parts of your input are not just a sequence because if they were just a sequence, you could use the recurrent neural networks that we have looked at in the previous module. And also, they may have different structures. So each of these parts may have a different structure and may require a different kind of neural network to actually reason about this part of the input. So just having one big model that reasons about all of these parts in the same way is probably not the best option. Let's have a look at some examples where this situation occurs and where you may want to use a hierarchical neural network. So one kind of example is if you want to reason about any kind of document. So let's say a document could be like a research paper or maybe a term paper that you're submitting for some course. And each of these documents consists of multiple parts or multiple kinds of inputs that are embedded in this document. For example, you have lines that consists of multiple words. You may have some images in there or you may have some plots that show some kind of graph or some scientific result. And for each of these parts of this input, a different kind of neural network may be most suitable to reason about them. And one way of combining these different neural networks is using a hierarchical neural network. And then once you have this combination, you can use this entire network to reason about these documents. For example, if these documents are term papers submitted to a course, you could use this representation that you get from the hierarchical neural network to maybe predict the grade that someone would have or predict whether someone should pass or fail based on the document that was submitted. Of course, that's not the way we actually do grading here, but this is just an example of what one could do. Another example that is more related to software because that's the focus of this course is that you may have some evidence of program crashes. So say you have a lot of program crashes because you have lots of people using your software, then for each of these program crashes, you may have some piece of evidence. And actually, you have not just one piece of evidence, but multiple pieces of evidence that come in different formats. For example, you may have a stack trace, which is basically the functions that have been on the stack when the program crashed. That's one piece of evidence. You may also have some kind of error message. So some text that tells you what went wrong and this text may be in natural language or a mix of natural language and something that, yeah, programs typically emit and only programs can understand. And you may also have some information about the user on whose device this program has crashed. And this may, for example, come in the form of key value pairs that tell you things like what was the operating system, when was the last update of the app that has crashed, and so on. What's important here is that each of these parts of the input, so each of these pieces of evidence about the program crash may again require a different kind of neural network to reason about. But at the end, you may wanna make one prediction for the entire program crash. So for all of this, you basically wanna have one combined representation and then your neural network should be able to make one prediction. For example, you may wanna predict what parts of the code base a developer should fix in order to fix the program crash or which team of developers should actually take care of this crash. Let's look at two more examples. And actually these two examples that we look at now are also what we look at later in this module. So one more example is if you want to reason about parts of your program, so program elements that may have a type. For example, a function typically has parameter types and return types. Now, if you want to predict these types because your language does not require developers to write these types down when they write a program, you may have different pieces of information. For example, there may be some code tokens that are related to this type. There may be some identifier names like the name of a function that tell you something about the type that this function may have. And you may also have some other information like natural language information that comes in the form of comments associated with a function or some other program element. All of this, again, is useful information and all of this should be combined in some way so that at the end you can make one prediction about the type of a program element. And again, for each of these parts, you may want to use a different neural network or a slightly different way of looking at the input in order to make best use of this data. And again, hierarchical neural networks can be very helpful for this. Another example, and that's the last in this list now. And again, that's an example that we look at more later in this module is if you want to reason about commits to a code repository. So for each of these commits, you have at least two pieces of information. You have the actual code change. So basically what lines of code were changed and then what files were these changes. And you typically also have a commit message. So a natural language sequence of words that a developer has written to describe what this code change is actually doing. And again, these are different kinds of information. So you may want to reason about them using different kinds of neural networks, but at the end you want to combine everything. And this is something you can do very well using a hierarchical neural network. Now, before looking in more detail into hierarchical neural networks, let's consider a more naive approach that basically serves as a baseline and motivates the use of hierarchical neural networks. So for this naive approach, let's get back to the first example that I've mentioned, where we wanted to reason about documents and we said these documents are composed of some text, some images and maybe some plots. And now let's suppose that at the end we want to make a prediction that tells us whether the document, which may be a term paper, for example, should lead to the student failing or passing the course. Now in this naive approach, let's assume we have some way of representing each of the words in this text as a vector. So we have all these little vectors here that each represent one word. And in a similar way, we can represent our images as let's say a sequence of pixels. So we basically get one long vector that looks like this and this represents all the pixels that we have in an image. And in a similar way for the plots, let's assume they also come just in the form of an image. So again, we just have one long vector that contains all these pixels of the plot. And now what we would do in this naive approach is to basically concatenate all of this. So first we would have a concatenation of the word vectors which would give us one long vector that represents the entire text. And then we would concatenate all of this with the pixels that come from the image and the pixels that come from the plot which at the end gives us one really huge vector. And now this really huge vector we could feed into, for example, a feed forward neural network. Let me just add some annotations here. So this would be a huge vector that summarizes all the information. And then we have a feed forward neural network for simplicity, let's just say we have just two layers here. And then at the end, this outputs a vector of size one which is the probability that this document corresponds to a student who passes or fails the course. So this is just a single number which we here interpret as a probability, for example. Now what is the problem with this naive approach? Well, the main problem is basically the size of this huge input vector. So because this input vector is so huge, our feed forward neural network will be very slow. And another disadvantage is that we do not really make use of the structural information that we have about the input. We know that the text consists of words and that these words are a sequence and that we maybe should use an RNN, but we don't, we just concatenate all these words. And similarly, we know maybe that in a plot, it's not just, it's not just a pixel, but we actually have different parts in the spot. We have a curve, you may have a legend, we may have labels on the axis of a plot. And all of these different parts may actually be used in a clever way to make better sense of this image that represents the plot, but we don't do any of this here because we just concatenate all the pixels and use them in the most naive way possible. So what we have is not just a very slow network, but also a network that will not give very accurate prediction simply because we do not use the structure that the input has. So now instead of using this naive approach, the idea of hierarchical neural networks is to build a neural model that is composed of different sub-models. So essentially we are decomposing the problem of reasoning about all of this input into different sub-problems where we say, okay, there's one model that reasons about this part of the input, another model that reasons about that part of the input, and then all of this should be combined at the end into one prediction. The way this decomposition typically works or at least works in a hierarchical neural network is that we align all these sub-models into a hierarchy. So for example, you can think of this hierarchy as a tree where all the inputs are at the leaves and then this tree or these inputs are combined piece by piece until at the end, we reach the root of this tree where we have basically everything combined into just one vector. So in a sense, we are still combining everything into one vector as in the naive approach, but now this vector will not just be the huge concatenation of all the inputs that we have, but we will reason about different parts of the input individually and then get a vector representation of a manageable size. And then once we have this vector representation, we can eventually make our prediction based on all the summarized information that arrives at the root of our hierarchical neural network. So let me give you some more information about these sub-models. So the idea of this hierarchical neural networks is to compose a larger model into different sub-models and each of these sub-models essentially encodes one part of the input. Now the cool thing about this is that we can have different kinds of sub-models that are well-suited for the different kinds of inputs that we have. For example, if we have some input that for which a simple feed forward neural network is just a good way or the best way to handle this kind of input, then let's just use this. But if we have some input that is naturally a sequence, for example, a sequence of words, then we may wanna use a recurrent neural network to reason about this part of the input. So basically for every part of the input, we can have a different kind of neural network which just summarizes the information that is in this part of the input. And then at the end, all of this gets composed further so that at the end of this hierarchical neural network, all the information is summarized. So let's get back to the example that we just looked at this prediction or this classification of documents, but now let's use a hierarchical neural network. So this will be a hierarchical model to classify documents. So again, the input and the output is the same as before. So we have some texts that we find in these documents. We may have some images and we may also may have some plots. And at the end, again, we want to make a prediction whether this document corresponds to a term paper, for example, for which we want to pass or fail the student. Now, what's different is the way how we use these different parts of the input. So again, we know that a text consists of multiple words that we can each represent as a vector. But now instead of just concatenating these words, let's assume we feed all of those into a recurrent neural network, which we have seen in the previous module. And then what this recurrent neural network will give us is one vector that summarizes all the text or all the words that are in this text. So using an RNN here has two advantages. One is that the RNN gives us a shorter vector. So this is a relatively short vector that summarizes all the text. And the other advantage is that the recurrent neural network, as we've seen in the last module, is able to forget about some words in this text but pay attention to some other words. So it's essentially taking the important bits out of this text and puts only those important bits into the summary vector. Now, for the images that we have in our document, we will do something different. And the reason is that images are just not text, so it makes sense to use a different kind of neural network. So again, let's say we have all the pixels in this image, but now instead of just taking them as they are and looking at them as a huge vector, we will use a convolutional neural network, which is a kind of neural network that we will look at in some more detail in one of the later modules. Essentially, what it does is to recognize some of the different elements that are in this image. For example, it may be able to see that the image is composed of different parts and then combine this information at the end into one vector. How exactly this works doesn't really matter for now. What matters is that there is a specialized kind of network for this kind of input. So then what we get here is a short vector that represents the content of this image. And now for the plots, we can essentially do something very similar. So we again specialize and come up with some sub-model that is good at reasoning about these plots. And for example, we could say that also we have this sequence of pixels that are in this plot, but in addition, we have maybe the caption of the plot, which actually is a sequence of words. And for these pixels that correspond to the actual figure of the plot, let's say we again use convolutional neural networks, but for these words in the caption of the plot, we actually want to use a recurrent neural network. And both of them will give us again some relatively short vector. And now what we can do having all these short summaries of the different parts of this input is to combine these parts into one longer vector. And one simple way of doing this concatenation, this combination is to actually do a concatenation, so to just concatenate them. So basically this vector that we have here at the end will be composed of the different vectors. So this will be the blue part, this will be the green part. Then we have this part here and maybe that one here. So this is just a concatenation. And then this is used to make our final prediction, which again at the end, should just be a vector of size one, which we will interpret as the probability that the student should pass or fail the course. And to do this, we may again, just have a feedforward neural network that takes this colorful input vector that summarizes all the input and then makes a decision based on this summary. And what's important here is that this colorful vector in the middle is really a summary of all the input. One interesting question is how we can actually train such a hierarchical neural network. Just as a reminder, training essentially means that we take a set of input-output examples. So for example, a set of documents and corresponding booleans that tell us whether the student passes or fails the course. And then we use these input-output examples to adapt the weights that control how our neural network makes the decision of the output based on the input. Now the question is how to actually do this training and the two interesting options to consider. So option one is that we now could train each of these sub-models separately. So for the example of classifying documents, we could train one sub-model to summarize the text, one sub-model to summarize images and so on. The advantage of doing this separate training would be that the training could focus on each specific model and each specific kind of input that this model receives. The two big disadvantages are the following. One is that we would need training data for each of these sub-models. Basically, we would need to have some data that tells us how to summarize the text in a document and some other data that tells us how to summarize the images that may occur in documents. But what we actually only have is training data for the entire document, which tells us whether the student should pass or fail. The other disadvantage in this separate training approach is that the sub-modules are not aware of what the actual task is that we wanna achieve. So in our case, the task is to predict whether the student who has submitted a document should pass or fail. But if our task would be something else, like for example, finding out whether the text is written in English or French or German, then maybe we would use a different kind of model and our training would actually look very differently. So only by being aware of the final task that the entire model is supposed to solve, and we can effectively train the sub-models to help with this overall task. The second option for training these hierarchical neural networks, and this is the option that is actually used in practice when training this kind of model, is to train the entire model jointly. So what this essentially means is that all the weights of all the sub-models are adapted together based on the or guided by the input output examples that we have for the overall task. So this is different from optimizing one sub-model at a time, because now we optimize everything together such that overall the accuracy of the model is as good as we can get it. This joint training has a couple of interesting advantages. So one is that we do not need training data for all the individual sub-models, but we just need training data for the overall task. So for our example of deciding whether a document submitted by a student should lead to the student passing or failing a course, we just need these end-to-end examples of documents and pass-fail values, and then we can train all the sub-models based on this data. The other big advantage is that the sub-modules get optimized for the overall task, which means that we do not just, yeah, optimize a sub-module to summarize text, but we optimize a sub-module to summarize text for the task of predicting whether this text is part of a passing or failing student's document. One potential disadvantage, which however in practice doesn't really matter that much, fortunately, is that if you have a large hierarchical model, it may be that the feedback that you have for the final prediction may get lost on the way while propagating it back through the network. This is called de-vanishing gradient problem, and you can look up more information on this somewhere if you're interested, but in practice, fortunately, at least for the models that we are discussing here, this isn't really a problem. Let me illustrate this joint training with a little example. So for this example, let's assume we have some input given that again consists of multiple parts. So let's say we have input part one and input part two, and again, you can map this to the example that we've seen earlier or to any other example. And at the end, we wanna make some prediction, and again, this could, for example, be classifying whether the input is part of one class or another. Now, for each of these inputs, we have some sub-model, which I'm leaving kind of abstract here. So I'm not saying what exactly it is. It could be an R and N, it could be a C and N, it could be a feed forward network, or any other kind of model that is suitable for the specific part of the input. And then all of this gets combined and for example, concatenated at some point, and then we have some model that makes the actual prediction. In practice, of course, we could have more of these hierarchical steps. So here we have just one, but in practice, there could be even more. Now, what's important for the joint training is the following. Each of these sub-modules is controlled by some set of weights and biases. So for example, here we may have a matrix of weights that we may call W1. Here we have a matrix of weights called W2, and here we may have a matrix, a matrix of weights called W3. Of course, in practice, there may be more than one matrix of weights in each of these sub-modules, but to simplify things, let's just assume that there's one. And now what we do when we train this jointly is that we have some input-output examples here. So let's say we have some input one and some output one, and then input two and output two, and we give all of these inputs to the network. And once it has seen these inputs and once it has received some feedback in the form of a loss function, as we have explained in the first module of this course, we will be able to back propagate this feedback through the modules, and then update all of these weight matrices together. So we are updating all of them in order to have better weights so that the next time we get a set of inputs and produce a set of outputs, we actually closer to the expected prediction. All right, so this is already all I wanna say about hierarchical neural networks in general. So this is the end of the first part in this module on how to use hierarchical neural networks for analyzing software. I hope you now have an idea when and why to use such a model and have an idea how it gets trained as well. And then in the later parts of this module, we will see two applications for using hierarchical neural networks for type prediction and for reasoning about code changes. Thank you very much for listening and see you next time.