 Hi, welcome back to analyzing software using deep learning. So this is a course at the University of Stuttgart. And in this part of the course, we will look into how to use convolutional neural networks for classifying programs, or more generally, for reasoning about programs. Convolutional neural networks have been pretty successful in other areas where you have to reason about some kind of hierarchical data. And because programs are naturally also represented as hierarchical data, it makes a lot of sense to also apply convolutional neural networks to programs. And in this lecture, we will see how to actually do this. As for most of these lectures, I'll start by giving an overview over the kind of neural network architecture that we want to talk about here. And these are, in this part of the lecture, convolutional networks. So I'll introduce what they actually are, why they make sense, and what their properties are. And then in the second part of this lecture, we look into one specific application of convolutional neural networks for program classification, which is based on a paper that was published a couple of years ago and has had some impact since then. As usual, if you want to know more about this specific approach, feel free to look into the paper because there are many details that we cannot cover here, but they are covered in the corresponding paper. Let's start with some historical background. So as many interesting results in neural networks, the idea of convolutional neural networks has been biologically inspired. And specifically, the inspiration here came from experiments done on animals to find out why and how animals are actually seeing and how their brain processes whatever they can see. So in this set of experiments which were done with cats and monkeys, the researchers found out that some neurons in the visual cortex of these animals, so in the part of the brain that is responsible for seeing, that some neurons in this part of the brain correspond to small regions of the visual field of what these animals are seeing. It specifically turned out that there are two kinds of visual cell types. On the one hand there, what they call simple cells that correspond to specific parts of the visual field. For example, they respond to straight edges that have particular orientations that appear in the visual field of these animals. And then on the other hand, they're complex cells that are not sensitive to these particular small features in the visual field, but that are sensitive to larger receptive fields and can then see larger objects or larger parts of what is in the visual field of these animals. And these complex cells, interestingly, are insensitive to the exact location of what they are seeing. So they can basically see an object that is here or there. And no matter where this object is, these complex cells are able to recognize that this is this kind of object. And these findings in neuroscience on animals have actually inspired these neural networks that do something similar, but in an artificial neural network. And initially, this work has been proposed in image recognition. So we'll first talk a little bit about how this generally works in image recognition and then see how this applies to software. So let's start by giving an intuition of how these convolutional networks work. So the basic idea is that you have some input data and that this input data is hierarchically organized. Now, what does this mean? Well, we'll see what it means. This could, for example, mean that on an image such as the one that you see here, the different objects and they may be nested into each other. So for example, on this image, you see a house. The house has windows. And maybe the windows have window shades. And these different objects are organized in some kind of hierarchy. The same may be true on source code, where, for example, you may have a project that consists of different packages. And each of these packages consists of different files. And the files then consist of some kind of code structure that may be represented as an AST. Now, before looking into code, let's have a look at how we can use this hierarchical organization on images. So essentially, what happens is that there are three steps. The first step is to look at the raw data and to identify some kind of primitive features. So these primitive features could, for example, be oriented edges in this image. So for our little example here, what this first step could, for example, recognize is that there are some edges here like this, which correspond to the roof of this house. And then there are some other edges here like this, which correspond to one of the windows. There may be some edges down here, which correspond to this car. And then inside this, or sort of overlapping with this, there may be other edges that may correspond to some of the wheels of this car. Now, given this, we can actually recognize some of these parts of the objects that you see on the picture. And this is the next step, where some model recognizes object parts. So for example, we may be able to recognize that on this picture, we have some wheels. And we will see that these are the wheels. Or for example, we may be able to recognize that there are some windows here. For example, this is a window. And as you see, this lifts the primitive features identified in the first step to a higher level of abstraction by combining different oriented edges into some parts of an object. And then the last and final step is to look at another level in this hierarchy where we can now identify entire objects based on the object parts identified in the previous step. So for example, for our little image here, we may be able to identify objects such as a car or maybe an entire house. So let's say what we'll see is that these two wheels and maybe some other object parts that I haven't detailed here on this slide will be part of the car. And that some other parts of this image correspond to this house that we see here. And this final task is one of the tasks that are typically done in image processing. And this is called object recognition. Now, one way to do this hierarchical reasoning, and this is the way we focus on here in the lecture, is to use a convolutional network. Similar to some other networks that we've seen in previous lectures, this is a feedforward neural network. So basically, information moves from the input towards the output without any kind of recursion or loops. But in contrast to some of the networks that we've seen previously, this network exploits the hierarchical structure of the input data by using specific connectivity patterns in this feedforward neural network architecture. And we'll see how exactly this works. What is important here is that a convolutional neural network is not a fully connected network. So not all neurons in one layer are connected to all neurons in the next layer, but instead only some neurons are connected with each other. And we'll see why this makes sense. In order to summarize specific parts of the input and in order to do this again and again in a hierarchical manner, there is something called a convolution function, which is essentially a mathematical approximation of the stimuli that we have seen in this experiments with animals within the receptive field. So basically, what the convolution function does is to take some part of the input, for example, some part of an image or some part of a program and summarizes it in a specific way. Now, these convolutional networks have a lot of applications. We look a little bit into image and video recognition because that's a very intuitive domain, and that's also the domain where these networks have been proposed in originally. They are also used in natural language processing because natural language also has a hierarchy. So a document may consist of sentences, which consists of words, which may be compound words, and so on. And then, of course, and that's the main focus of this course, they can be used for reasoning about programs. For example, to classify programs or to do some other kind of reasoning on top of programs. To understand why we do not want to have a fully connected neural network, but instead use this idea of convolution, let's start by reasoning about the complexity of a fully connected neural network. So this kind of reasoning that we'll do now is not specific to a specific input domain. So it's not specific to images, for example, but we'll use an image as the example. So for this example, suppose that we have some input of length n. And for our example, this could be an image. We also have just one single output. So for example, this could be some numeric value that predicts what is on this image, or maybe just a binary classification that tells us whether the image is, let's say, an image of a cat or not. And now, within the network, we have three hidden layers. So it's a moderately sized neural network. And these layers are fully connected. So basically, every neuron in one layer is fully connected to all the neurons in the next layer. Now, looking at this input, sorry, at this network, we'll see that it looks as follows. So somewhere we have our input neurons, which I just write down here at the bottom. They go from 1 to n. And then we have the first hidden layer. And here, let's just make the simplifying assumption that the hidden layers are just as large as the input. So it's, again, n neurons. So we have these next n neurons here for layer number one. And because it's fully connected, this neuron will be connected to all of those. And the same for this neuron, which is connected to all of those. And of course, this one is also connected to all of those. So basically, what we have here is the input. And then up here, we'll have some fully connected layers. And the one that I've just drawn is the first of them. So now we said we'll have three of those. So basically, that means there are two more that look very similar. And again, they are fully connected. So all of this goes here and here and here. I'm drawing a little faster now. Hopefully, it's still readable. And then there's the third layer, which, again, is fully connected like this. And then we said we have a single output, which basically means there's one neuron up here, which will serve as our output. And in order to receive information that we can output at the end, it needs to be connected to the final layer. And because it's a fully connected network, this means there is a connection from each neuron in the final layer to this output. Now the question is, how many weights does this network have? So that's a measure of how complex this network actually is. And here, let's make some more assumptions, namely about the size of n. And the question is, how many weights does the network have? Four. And now let's just make some assumptions. And let's say our input is an image. And we represent this image as pixels. And let's say it's a very small image, or maybe a downscaled image, of just 32 times 32. So 1,024 pixels. So this is a small image. Or it could be any other data that you can represent using 1,024 input values. Now, how many weights do we have in this network? Well, because it's fully connected. And because each of these connections is controlled by one weight, we'll have n times n weights that connect the input layer to the second layer, or to the first hidden layer. And then another n times n, so n squared, that connects the first hidden layer to the second hidden layer, yet another for the second hidden layer connected to the third hidden layer. And then finally, another plus n connections from the last hidden layer to the output. And now if you do the math, this is roughly 3.1 million weights. And as you can see, this is a surprisingly large number, given that the input is relatively small and that our output is just a single neuron, and also three hidden layers doesn't sound so much. But because this is a fully connected neural network, we have a lot of weights. And in practice, this means that each of these needs to be stored in memory. And also each of those weights needs to be optimized individually, because we want to train the neural network based on some data. And that essentially means that each of these weights needs to be optimized. So the takeaway of this is that having a fully connected neural network involves quite some complexity. And the question is whether we can do any better based on the knowledge that our input data is actually hierarchically structured. The answer is, of course, yes, because the idea of convolutional neural network is to reduce this complexity based on this idea of convolution. So what is the idea? The idea is to not have a fully connected computation graph as the one we've just seen. But to instead have, well, two properties. One is we want each input to only influence k other neurons. And this basically limits the number of connections that we need to have in our computation graph, because well, if every input influences at most k neurons, we only need k outgoing connections for each of these inputs. So just to illustrate this, let's assume that k is equal to 3 and that our network or part of our network looks like this. So we have five neurons in one layer and then five neurons in another layer. And now instead of having all neurons in the input layer, so this is the input here. And then this is what we'll call the convoluted input. So instead of having all of them connected to each of them, we basically just for every input neuron have three of the neurons in the next layer influenced by this input neuron. So for example, the input neuron here in the middle just influences the three neurons indicated up here and the same for all the others. So this one will just influence this and this and there's none on the right. So this is only two actually. This one will influence this and this and this. The one in the middle we've already covered and then on the right basically the same happens. And what you see is that we do not have all the connections. So for example, the input neuron on the right is not influencing the leftmost neuron on the convolution layer. So this is one of these ideas which is about how many neurons are influenced by a specific input neuron. And conversely what you also get from this idea of convolution is that each neuron in the convolution layer is activated by at most k neurons. So also from looking at it from the other side, there's less complexity. So let's again illustrate this with a little picture. So these are again the neurons in our convoluted layer. And down here we have our input neurons. And now just looking at the same connections that I've just drawn, but now basically from the other perspective. So how many neurons are responsible for activating a neuron in our convoluted layer? So let's have a look at this one here. Then you see that this is actually activated by just three neurons instead of all neurons from the input layer. And similarly for the others. So this one here up on the left is influenced or activated by these two. Similar here. And then the same on the other side. So now reducing the number of connections in our neural network is one part of the trick to reduce the complexity through convolution. The other part of the trick is to not learn how to control each of these connections separately, but to actually use a so-called kernel which corresponds to parameters or weights that control these individual summarizations from one layer to the next layer. So let's again look at our little example here where we have this input I. So I is this input vector. And then what we'll get at the end is the convolution S. And now the idea is that we want to control how for example these three neurons are influencing this one. And this step of mapping the inputs at the bottom to the one neuron at the top, this is what is called convolution. And the way this is controlled now is by something called a kernel which is essentially a function over in this case three inputs that will give us one value which is then stored in the upper pink neuron SD output. So this function is called a kernel. Let's call it K. And this works over a vector of length K. And in our example, K is three. So our entire input goes from J equals zero to J equals four but K in this case is only three. Now let me also add the connections here. It's the same as what we had before. So these three is basically what is controlled by the kernel and then by just moving the kernel around we can also do the same for the other parts of the input. And the trick is that we use exactly the same kernel for all the different parts of the input. So the convolution S is the product of some part of our input I and our kernel K. And when I say this is the product I'm ignoring the boundaries here. So of course at the beginning and at the end of the input we need to have some special handling. So let me just make a concrete example. So let's assume that our kernel is a vector two, five and three. And in practice these numbers will be learned because these are basically the parameters or the weights of this kernel. And let's assume that our input, so the full input vector is three, one, two, three and four. And then in order to compute the convolution at index two, so the one that is marked with pink color at the top we would basically compute one times two plus two times five plus three times three. So the kernel tells us how to summarize these three values in the window that we look at in this input vector into just one vector. And then we move the same kernel to the left and to the right to do the same for the other windows that we can put over this input. So now I've shown you how to do it in a single dimensional or one dimensional scenario where we have basically an input that is just a sequence of neurons. But now the motivation came from images. So let's also have a look at how to do this in a two dimensional scenario, which could for example be the pixels of an image. So let's suppose this image looks like this. So we have just a few pixels here for this example and I just call them ABC and so on in order to be able to refer to them. And now what we want to do is we want to summarize these pixels into a smaller dimension. So let's say we want to convolute it so that at the end here, we have a two times three matrix instead of this three times four matrix that we had before. Now in order to do this, we can have a kernel, which is basically a function that tells us how to take four pixels and summarize them into a single value. So just putting the same terms that we have used on the previous slide here again. So this down here would be our input I. What we want to get at the end is a convolution called S. And we do this using a kernel called K. And now in this kernel, we'll have some weights or some parameters. Let's call them W, X, Y and Z. And these parameters basically tell us how to convolute four pixels at a time out of the input image into our convolution matrix. So for example, we can take the kernel and basically put it on top of these four pixels here. And then it will summarize all of this into a single value at index 00 of our convolution matrix. And then we can move this window of pixels we look at, for example, to those four. And this will give us another value, which is then put here. And similarly, we basically do this for all possible windows of four pixels in this input matrix until at the end we also arrive here and then we filled all the six values in our convolution matrix. Now, how do we compute these values in the convolution matrix? Well, the idea is the same as before. We basically multiply the input values with the kernel just now in two dimensions. So in general, if you want to compute S of i and j, then we sum over one dimension and the other dimension. And for each of them, we look into what we have in the input and then multiply this with our kernel matrix. So for the concrete example, let's say we are interested in S at indices zero and zero, then we will basically take A times W plus B times X plus E times Y plus F times Z. And these four values that we then add together give us one value, which is S of zero and zero. And similarly for the others, so for example, S of zero and one would be B times W plus C times X plus F times Y plus G times Z. And finally, I'm going to give you and finally for S of one and two, we'll get G times W plus H times X plus K times Y plus L times Z. And the idea is that each of those computations is done using the same set of kernel parameters. So we do not have to learn that many parameters because we basically just need these four parameters to go over our entire image, one window at a time, and then summarize each of these window into a single value. So now you've seen this idea of using convolution in one dimension or in a two-dimensional scenario using these examples. Let's now reason a little bit more about the properties of this idea of convolution that help improve learning on hierarchical data. And there are three of these properties, namely sparse interactions, parameter sharing, and equivariant representations. And we'll go through each of these three now and see what they actually mean and why these properties are useful for learning about hierarchical data. So the first of these three properties is that we have sparse interactions. And what this essentially means is that we have fewer connections than we would have in a fully connected neural net. And specifically, what does this mean for our training process? Well, it means we have fewer weights that we need to store. And it not only saves memory, but it basically also saves us time because it means we have fewer weights that we need to optimize. Now the cool thing is that despite having these sparse interactions, if you have a deep network, so many layers, then the neurons in deeper layers may still be indirectly influenced by many of the inputs. And this is actually a nice property because otherwise you would have a very local reasoning and no reasoning across different parts of the inputs, which we actually want to have, but this is still possible if you have a deep enough network. So if we have such a deep network and again deep basically means many layers, then the neurons that you see in the deeper layers, so the layers that are further away from the input, these neurons may be indirectly influenced by many inputs so they may indirectly interact with many inputs. So let's just illustrate this with a simple picture. So let's say we have five input neurons and not just one hidden layer here, but another hidden layer. And let's say we have a convolution with K equals three. So basically three values of one layer are always summarized into a single value in the next layer. Then we would basically summarize these three neurons into this and then these three into that and these three into this neuron here. And then in the next layer, we would again have a convolution with K equals three and then would basically summarize all of this information into this one neuron in the middle at the top. And that means that all of these neurons down here in the input have an influence on this one neuron at the top because we have different layers and that means we can still get non-local reasoning that reasons maybe about the pixel in one corner of the image and the pixel in the other corner of the image despite the fact that we do not fully connect our network. The second interesting property of these convolutional neural networks is that we have parameter sharing. So what does this mean? So essentially it means that the same parameters are used for multiple connections in our network instead of having one parameter or one weight for each individual connection. So let's just contrast what we do in a convolutional setup with a fully connected network. So if you have a fully connected network, then the parameters that we need to optimize are all the weights in our weight matrix. And that typically means as many weights as we have connections in our network. Now for convolutional network, we have fewer parameters to optimize simply because we are reusing them. And specifically what we need to optimize here are just the weights in our kernel matrix. And because we are moving this window of attention over our input and reuse the same kernel matrix over and over again, there will be fewer parameters to optimize because we are essentially sharing them across different places in our input. So what this means is that in a fully connected network, each weight is just used for exactly one connection. And in contrast in a convolutional network, the same weights that we have in our kernel matrix are used for many parts of the input. So the big difference here is this one versus many. So let's just have a look at this using a little picture again. So if you have a few input neurons down here and then look at a layer that convolutes these inputs and have some kernel matrix, then what we'll see is that basically this connection will share the same parameters with that connection and this shares the same with that connection and with that connection and with that connection. And the same is of course true if we have K that is larger than one. So let's say K is three, then we will have also connections like this and also they share the same parameters and the same for the third connection that we have here for each of those. So they also share the same parameters. The third interesting property of convolutional neural networks is that we get equivariant representations. So what this essentially means is that if our input changes in some way, then the output changes also in some way and this is independent of where exactly this input is. So the change that we see in the output for given input change is the same no matter where this input is. So just illustrating this with a little example, let's suppose we have some inputs down here and then compute a convolution of them up here and let's say we again do this using our window or our kernel with K equals three. So we'll get something like this and now let's say we have some specific input down here. So these two pixels really matter and they influence this value of this neuron up here. Then this will be exactly the same kind of influence as what we'll get in a different scenario where we have the same number of inputs and neurons in the convolutional layer but where this input happens to be at a different location. So in this second scenario, let's assume our input is not in the second and third element of the input but it's actually here in the third and fourth element and because we use the same kernel to convolute this input, we'll get exactly the same output and the same influence on this neuron up here as we get on the left, simply because it's the same just at a different location. Now what does this mean for the different applications? So let's just look at two applications that are relevant here. One is images and for an image, this basically means that if an object is moved, then its representation will move by the same amount but the representation will basically stay the same. So what does it mean for source code? So the application that we are most interested in here in this course, well, it basically means that if you have, for example, a statement that has a particular property, let's say it's a statement that is buggy or let's say it's a statement that uses a particular API, then no matter where this statement occurs, its representation stays the same and the network can then reason about the statement, no matter whether it occurs, let's say at the beginning or at the end of a source code file. So if you have a statement with a particular property, then this stays the same or its representation stays the same no matter where exactly the statement occurs. All right, so now you know something about convolution. Let's now have a look at the bigger picture and let's have a look at how this idea of convolution is typically used in a more complex neural network in order to reason about some hierarchical input data, for example, in an image or a program. So typically this idea of convolution is used in combination with some other steps and one typical way of using convolution which we'll also see in the second part of this lecture where we look into using convolution for reasoning about programs is to combine it as follows. So we have our input somewhere and then this input is given to some convolution layer which works exactly the way we've just seen or maybe multiple convolution layers and then given the output of this convolution layer which has basically summarized different parts of the input into smaller entities, there's some kind of detection layer which for example identifies objects or specific statements in the input and then there's something called pooling which essentially focuses the attention on the most interesting of the features that have been detected by the previous layer and then there may be some more parts. So we look into pooling in a second just to give you an idea of this detection layer. So what typically happens here is that we have some nonlinear activation function which then summarizes whatever has been summarized by the convolution into some higher level piece of information, for example, the fact that there is a statement of a particular kind. So now I've just mentioned pooling, let's have a look at what pooling actually is. So it's essentially form of down sampling. So we have a vector representation of some size and would like to compress this into a smaller representation while focusing on the most interesting or most relevant bits of information that we see in the input. And the basic idea of different pooling approaches is that the output at a particular location is replaced by a summary of nearby outputs. So we're basically compressing things that are close to each other into just a smaller representation. And the intuition behind these pooling approaches is that it doesn't really matter exactly where a feature is, but it's more important that it is present and how this feature is located with respect to other features. So for programs, for example, this could mean it doesn't really matter exactly where a particular statement is, but what matters is that the statement is somewhere and that it is maybe before or after or maybe around another statement that is also present in the code. One way of implementing this idea of pooling is what is called max pooling. And this is the only one we look at here because it also happens to be used in the approach we'll see in the second part of this lecture. So the idea of max pooling is to essentially summarize the values we see in a specific region of our vector representation by looking at the maximum output value within this region. So let's have a look at the concrete example to make this more concrete. So let's say our input is a 2D matrix that looks like this and let's say we have some values computed as the result of our convolution that look like that. So the original input would have been an even larger 2D matrix, for example, but then we applied convolution maybe multiple times with multiple convolution layers and at the end we have summarized the input into what you see here. And then if you would now want to use max pooling in order to further down sample this information, then let's say we wanna down sample it into a two times two matrix, then we would basically look at the maximum value in different regions of the given inputs. So we would look at the maximum value in here and see that this is six. So we would put a six here. Then for this region, we would again look for the maximum value and see that this is eight and same here. So we would look at this region and see that this is three and the same for finally this region where we would find four. And the idea is that these maximum values represent the presence of some features and by still keeping the same arrangement of these different regions. So basically the blue region is still left of the pinkish region. We still see that these features exist and also how they are located with respect to each other but everything is summarized into a smaller scale vector. All right, so I hope this gave you some idea about convolutional neural networks and how they work and what the intuition behind them is. Now we've mentioned programs and source code a little bit already in this first part of this lecture but now in the second part that follows next, we look into how to actually use this idea of convolutional neural networks to reason about source code. So thank you very much for listening and see you in the next part.