 Hi, welcome back to analyzing software using deep learning. So this is the second part of this module of the course where we look at the convolutional neural networks and how to use them to analyze programs. And in the second part, you now know how convolutional neural networks work in general. And what we now do is to look at one specific application in the area of programs where we'll see how they can be used for classifying programs. So what we look at here is based on a paper from 2016 by Muadal. So if you're interested in more details, please have a look at this paper. So the only idea is that we can apply this idea of convolution to programs because programs naturally have a hierarchical structure. You can think of programs, for example, as a hierarchy or as hierarchical data at a very coarse-grained level where you say there are projects. Projects may have packages. Packages may have classes. Classes may have methods and so on. This also works on a more fine-grained level, for example, at the level of code, where you can, for example, represent code as an abstract syntax tree. And being a tree, this is naturally a hierarchy that actually pretty well represents the way the code is structured. And this is exactly what is done in the approach that we are discussing here. So the idea is to look at source code using its abstract syntax tree and to summarize this abstract syntax tree using a convolutional neural network so that at the end, the network is able to identify important features in this code and then use them for some kind of prediction. So now, in principle, there are many different applications of this idea. So you can basically predict whatever you want about a piece of code once you have summarized it into some suitable way. For example, using convolutional neural networks. Today, we'll focus on predictions about classification, so about basically putting a piece of code into one out of multiple possible classes. These classes can be many things. So for example, you could predict where some code comes from. So let's say you have a piece of code, you do not really know whether it's from this project or from that project, then you may want to predict this using a convolutional neural networks that is able to serve as a classifier. You could also predict who has written the code. So if you're not sure where the code comes from, you may want to predict that. This may, for example, be interesting if you're not sure whether someone has maybe copied code from some source, where you may not want to allow copying from, or maybe you just want to know who has written this code, but you just don't know it for whatever reason. Another kind of classification that can be very interesting is to look for specific instances of bug patterns. So if you know specific bug patterns that people sometimes have in their code and you have a way to train a model to basically distinguish between instances of these bug patterns and other code, then this is a classifier that you can use for finding bugs. And in a similar way, you may also want to classify whether a piece of code is malicious or benign. So if you have enough training data for that, you can try to learn a classifier that distinguishes between these two kinds of code. So all of these applications are basically about classifying what code you have. And one specific kind of classification is another one, and that's the one we will look at here in this lecture. It's about identifying what functionality a piece of code is implementing. So maybe you just have this code, but you do not really know what exactly it's doing. So this can help you to figure out what kind of functionality a piece of code is actually trying to implement. So let's start with a rough overview of how this approach is working before we look into more details. So the idea is that as the input we have a program or maybe part of a program, and this program is represented as an abstract syntax tree and AST of the source code of the program. And now what we do is we feed this AST into a convolutional neural network. And we look at how exactly this magic box of convolutional neural network really looks like in a second, but what we'll get at the end is a single vector that represents a probability distribution over the different categories that we consider in our classification task. And in order to get such a probability distribution we'll use the softmax function that you've already seen in one of the earlier lectures. So what we'll get here at the end is basically these probabilities that, and what these probabilities indicate is how likely it is that a given program is in one of these different categories that we care about in our classification task. So for example, if you want to classify whether a piece of code is written by this author or that author or yet another author we would have all the possible authors here in this output vector and then each element in this vector would indicate the probability that a specific person is actually the author of the given program. So in order to look at source code as a hierarchical piece of data we need to have some hierarchical representation of the code. And as I already mentioned in this piece of work this is actually a tree representation. This tree representation is based on the abstract syntax tree of the code, so on the AST. But this AST has been modified a little bit such that it's essentially a binary tree which means that each node has at most two children. And once you have modified the tree in this way it is called a continuous binary tree. How exactly this tree is constructed goes a little bit beyond of what we'll cover here. If you want to know the details please have a look at the paper. But the basic idea is that it takes the normal AST and whenever there are more than two child nodes of a given node it'll create some artificial nodes in order to make the tree a little deeper but have only at most two children per node. Let's just have a look at an example of how such a tree looks like so that you have an idea and can get a better intuition about how this looks. So let's say we have a piece of code where we have some integer variable A which is declared and then initialized with a statement saying that A gets the value of B plus three. Then this would be represented as a tree that says that there is a declaration and this declaration consists of a type declaration and some code used to initialize the value and this code that is used to initialize the value in this case happens to be a binary operation. And now the type declaration consists for example of the type itself which in this case would be int and some identifier which in this case would be A and similarly for the binary operation we would have some identifier in this case B and some constant in this case three. Now in practice these ASTs can of course contain more details. How exactly the ASTs constructed here doesn't really matter. What really matters is that we have a tree representation that represents the hierarchical structure of the code and that this tree representation is a binary tree so at most two children for each node. So now that you've seen the rough overview and also know a little bit more about how our input looks like, let's now have a slightly more detailed overview where we look into the different steps of this approach. So the input as we've seen is some kind of binary tree which for example may look like this and then there's actually two steps. One is a step called representation learning and the goal of this step is to essentially you represent each of these nodes in our input tree as a single vector. So what we'll get is something like this. So this is the vector that represents the root node then we have a vector here that represents one of the children and another vector like this and the same here on the right hand side. So the overall structure of the tree that we get is the same as in the input but now in contrast to the input which was this AST we now have vector representations of each AST node. All right and then based on this representation as vectors of each of these AST nodes we'll now feed this into the actual convolution which consists of something called tree convolution which essentially just means that we apply this idea of convolution to a tree. So to this hierarchy that is described by the tree followed by some pooling in order to down sample the vector that we get. And out of this then comes a fixed size vector which no matter how large this tree is will always have the same size. And this is then eventually given to another layer which is a hidden layer and this hidden layer then at the end produces our output based on the softmax function which gives us a probability distribution across the classes that we want to classify. And in contrast to what is done in the tree convolution itself these connections that you see here so basically from the fixed size vector to the hidden layer and then from the hidden layer to the final output layer those connections are fully connected. So basically every neuron is connected with every neuron which is fine in terms of computational complexity simply because the maybe very complex AST has already been summarized into a fixed sized vector. So let's now look into these two boxes in some more detail and let's start with the first box which is about representation learning. So how does this representation learning work and what is actually the purpose of doing it? That's what we wanna look at now. So the idea is to represent every AST node as a fixed size vector. And this vector is called an embedding. In a later part of the course we look more into embeddings and how they are actually trained and what they are good for. But for now all you need to know is that the structure of this tree stays exactly the same. So for example, if the input tree looks like this then the output tree will have exactly the same structure. So the exact same number of nodes and edges and the same overall structure because each of these input AST nodes corresponds to one of these vectors here in the output. And the property that is ensured by this representation learning and that is ensured in general by vector representations that are called embedding is that nodes that are similar and similar here means in a semantic way should have a similar vector representation. So for example, if we have some nodes here that represent let's say while and for statements then these should probably have similar vector representations because both are about loops and both are basically structures in the code that represent a loop. But for example, the representation of while and some constant should not be the same because a while statement and a constant is not really the same thing. This embedding is learned in a separate pre-training step and because we look into embeddings in a different part of this course we do not really look at it in detail here now but all you need to know for now is that these representations are learned in a separate pre-training step that is independent of the convolutional neural network. So let's now look at the main part of this approach which is the tree-based convolution. So the input given to this step is this tree of vectors that we've just obtained from the representation learning so where every node corresponds to one AST node. And now what we want to get out of this tree-based convolution is another tree but now in this output tree we would not like to have these local representations of individual AST nodes but nodes should actually summarize the features of their children so that at the end the root node basically summarizes the entire piece of code. So again, this will be a tree of vectors but now the nodes do not just focus on a single AST nodes but they summarize the features of their children. So let's have a look at how this works and let's illustrate this with a little example. So let's say our input tree looks like this so we have a binary tree of vectors each of these vectors has the same size and then our output tree will have the same structure so we again have a tree of vectors that looks more or less the same. Actually, I don't really need arrows here so let me just remove those but now the difference is what these vectors represent because in the output tree they do not just represent a single node but basically everything that is under this node or specifically the children of this node and because we apply this idea recursively everything under the node. Now the idea to get this output tree is to move a fixed depth feature detector over the tree. So essentially what we'll do is we'll have some kind of triangular feature detector that looks like this and this is moved over the tree so we'll move it here and basically summarize this subtree into something that is then stored here and then we move this feature detector for example here where it will summarize everything it sees in there and we'll then put the result in this vector here. So just to write this down so we are moving this fixed depth and actually also fixed width feature detector and this is basically what is called the kernel in convolutional networks and we move this over the tree. So this feature detector basically sees a triangle that looks like this where we have something at the top and then some left part and some right part and because we have enforced that this is always a binary tree we know exactly that there's one left part and one right part. And now what convolution does is to basically summarize the left part and the right part and to write this into the top part and this works as follows. So the top part here called y is computed by using some activation function in this part and this approach it's the hyperbolic tangent and this now summarizes what we see at the inside this tree. So specifically what is summarized here is the value that we see at the top. So just the single note that is at the top of this triangular subtree and this is multiplied with a weight matrix that is used here for the convolution and then we do basically the same also with what we see on the left and that's also multiplied with some weight matrix but a different one and similar for the right part. And then finally, there's also some bias that is added at the end. And these four values is all this kernel has to learn. So this weight matrix for the top part, the weight matrix for the left part, the weight matrix for the right part and bias vector. These are basically all the weights and biases of the kernel. And as always for convolution, these same weights and biases are used for all the different places where this feature detector is moved. So for this triangle one and triangle two, the same weights are used eventually to learn how to summarize these three notes into a single note. So once we have summarized the tree using this idea of convolution, we still have a problem because at the end we would like to get one fixed size vector for the entire tree that represents this entire tree. But what we now have is actually a tree of arbitrary size because we do not really know how large this AST is at the beginning and different ASTs that we'll get will have different sizes. So in order to now summarize all of this into a fixed size vector, the approach is using pooling. So before this pooling step, what we'll have is a tree of fixed size vectors where each of the vectors represents what we've just seen. So what we get out of the convolution. So each vector represents basically a subtree so that at the end, the top node represents the entire tree. But what we see here have is that this tree has a varying number of nodes and a varying depth. And what we instead would like to have is a single fixed size vector because this is what we then want to give into the rest of the network, the hidden layer and the classification that we get at the end. So how do we get this? Well, the approach that is used here is fixed size pooling which is a form of pooling very similar to what we've already seen in the first part of this lecture where we basically just use the maximum value that is observed anywhere in our tree for each dimension of our fixed size vector. So let's have a look at a concrete example. So let's say our tree that we get out of the convolution looks like this. So at the top we, let's say have three elements. So our vectors are all of length three and they have values eight, five and one. And let's say this child node has values three, five and two and then its child node has values one, two and one. And down here, we also have some values that all have the same. So all of these vectors have the same length and here at the end we have one, one and nine. Then now if you would look at the, what we want to get is one fixed size vector that will also have a length three because we pick now the maximum value for each dimension. So we look at what is the maximum value at any first element that we see here in this tree and looking at the values we would see that the first, that the highest first element is this one which is, which has a value eight. For dimension two, we would look at all the second elements and would see that basically here and here we have a five and there's no higher value. And then for the third dimension we would see that here there's a value nine and this is higher than any other value. So at the end our fixed size vector would be eight, five and nine. All right, so now you've seen this oval architecture. Now the question is what can you actually do with it and what the authors do in this work that we talk about here is to identify the functionality of a given program. They look at this in two different scenarios. One of them is this one out of n scenario where they know that there are 104 different programming tasks that have been solved in the given pieces of code but you do not really know which out of these 104 tasks it is. So you want to find out whether it's task one or task two or task three or task 104. In scenario two the problem is formulated as a binary classification. So they basically pick out one of these tasks. For example, a task is about implementing bubble sort. And then the classifier is trying to find out whether a given piece of code is actually an implementation of say bubble sort or maybe something else. So let's have a look at how the authors evaluate these two scenarios and what kind of results they get. So for this first scenario, the one out of n scenario what they do is to take the solutions that have been submitted to some online programming education platform but basically people try to solve these 104 problems and many different people have submitted different solutions. So in total they have 500 solutions for each of these problems. And then they split this data into training validation and testing data using a three to one to one ratio. So they basically use three out of five parts of these data points for training then use one out of five to validate and assume the hyper parameters and finally report the results for the remaining 20% of the data. And what they get here is an overall 94% accuracy. So the trained model is pretty accurate in predicting which of these 104 programming problems have actually been addressed. To evaluate the second scenario this binary classification scenario the authors do the following. So they want to find out whether a piece of code is actually an implementation of bubble sort. And one reason why you may want to do this in practice is that it's known that bubble sort is pretty inefficient and that you should probably avoid it. So why not try to automatically find instances of bubble sort in given piece of code. So in order to train a model that is able to identify whether a piece of code is actually an implementation of bubble sort they construct a balanced training set consisting of 109 programs that do implement bubble sort and another 109 programs that implement something else but certainly not bubble sort. For the evaluation they inject bubble sort code snippets that are known to basically be bubble sort into 4,000 other programs and then also take 4,000 programs where they have not injected any bubble sort code. And the assumption is that these programs do not implement bubble sort but something else. And then the question is can the network trained to distinguish bubble sort code from other code actually identify which of these 8,000 programs are or contain bubble sort and which do not. And the answer is that well it can because the authors report 89% accuracy in this binary classification task so also on this task this convolutional neural network is pretty successful. Now in practice whether this really means 89% success in finding bubble source is a slightly different question because of course not half of all programs contain bubble sort so in practice the data that you would see in an evaluation would not be as balanced as it is here but this is still a pretty interesting result. All right and this is already the end of this second and last part of this part of the lecture where we have looked into convolutional neural networks and how to use them to summarize code in order to for example make some classifications about the code. I hope you found it interesting. Thank you very much for listening and see you next time.