 Hi, welcome back to Analyzing Software Using Deep Learning. We are now in part three of three in this module on token vocabulary and code embeddings. And what we'll do in this third part is to look not only into how to learn embeddings for tokens in a programming language, but how to learn a joint embedding that covers both natural language information and programming language information. This third part will be mostly based on this paper that you see down here, which is called Deep Code Search. So if you're interested in more details, please have a look at this paper. The line of work that we are talking about here is motivated by the observation that software is not just code. In practice, there are a lot of other artifacts associated with code. And in particular, there are many natural language artifacts. So think of documentation that comes with code or comments that are written into code or maybe some requirement documents that are describing what the code is actually supposed to do. Now, if you want to analyze software in its entirety, so not just the code, but also this natural language information, we need to reason about both the programming language information that is given in the code and the natural language information that is given in some other form. For example, we can do this for an application that we'll also look at here in this lecture, which is to predict some code snippet from a given natural language query. So it's essentially a search engine for source code that takes natural language queries as an input. Another possible application where it is useful to reason about both PL and NL information is if you want to predict or maybe check comments against code. So in the prediction case, you would take a piece of code and then try to automatically predict a suitable comment for it. In the checking case, you would take a piece of code and a comment and check if the two actually match or if maybe the comment is outdated or just not describing completely what the code is doing. A third possible application is to learn from API documentation. So for example, you could try to learn from a given piece of API documentation how an API should be used and then try to predict this in a learned model. So now the question is, how can we reason about both the programming language information and the natural language information? And here we will look into doing this on the level of tokens and words. So the program is basically represented as a sequence of tokens, as in the rest of this module. And the natural language information is represented as a sequence of words. The key idea here is to learn a joint embedding space. That basically means that we are not just learning and embedding for programming language tokens or another embedding for natural language words. But we're trying to embed both programming language tokens and natural language words into the same single vector space. So the goal here is that any token and word that is related to each other will be close by in this space. So we will not only have related tokens close to each other and related words close to each other, but we will also have tokens and words that are related to each other close by in this vector space. As one application that is built on this idea of a joint vector space, let's have a look at so-called deep code search. So what this approach is essentially doing is to take a natural language query that describes something, some piece of code that you're looking for. For example, it could say, read an object from an XML because you want to find some code that shows you how to read an object from a given piece of XML. And then there is this learning-based code search engine, which will give you some code snippet that matches given natural language query. For example, for this given query here, it might return this piece of Java code, which if you look closely is actually about reading an object from a piece of XML code. Now what is interesting here is that even though this code actually matches the natural language query, there are not many words or terms or tokens that are the same among the query and the code snippet. So there are terms that are similar. So marshalling or un-marshalling, for example, means to read something from a file, but it's not the same word as read. So if you would just compare the words with each other, it would be difficult to find this piece of code given the query. But because this deep code search uses a joint vector space of natural language words and programming language tokens, it is able to see that reading and un-marshalling are somewhat related, and therefore it can return this code snippet. So let's look a little deeper into this green box that shows this natural language query to code search engine and let's have a look how it uses this idea of a joint space where we embed both code and natural language information. So what I'll give you is an overview of the approach and then in a minute we'll also look more into the neural network models that I actually used here. So what we get as the input is two things. One is the source code and the other is some natural language description of the source code. So you can think of this description as a query that someone might insert into the search engine, but it could also just be some kind of documentation or comment associated with the code. And now for each of those, there's some neural network that computes an embedding of the given piece of information. So here we have a code embedding network and we'll see in a second how exactly that works. What it computes at the end is some vector that represents the given piece of code, so a code vector. And on the other side, we also have a description embedding network. So another neural model that takes the given natural language description and also summarizes it in a single vector, which is here called the description vector. And now given these two vectors, the model compares them and basically looks at their similarity by looking at the cosine similarity between these two vectors. And the overall goal of this architecture is to find the two vectors to be very similar to each other if the description is actually a description for the code. And these vectors should be not similar at all if the code is doing something very different from what the description is saying. Let's now have a more detailed look into how these two embedding networks look like by having a better look into the neural model used in this deep code search approach. So on the one hand, we have the code. And the way this code is represented is a little bit different or unusual in the sense that they do look at the tokens, but they only look at a set of tokens. So they do not really keep the order in which the tokens appear in the method, but just look at the set of tokens. So let's say we have this token at T1, T2, and T3, and then some others. And then what happens next is that each of these tokens is given through a fully connected layer. And these layers are all the same. So you can basically think of these layers as an embedding that is trained jointly with the rest of this model. And then what comes out of this layer is a vector. And because we have multiple vectors, one for each token in the set of tokens, we need to combine these vectors in some way, and here max pooling is used. And then what comes out of this is finally the one code vector that summarizes all the information about the sets of tokens in this piece of code. In addition to this, the actual approach as described in the papers using two other pieces of information, namely also information about the method name and information about the APIs that are called inside the method. But I'm just focusing here on the tokens because that's the main theme of this module of the course. On the other side, there also is the description. And this description is modeled as a sequence of words. And the way these words are then reasoned about by the model is as follows. So we have this sequence of words, W1, W2, W3, and so on. And all of these words are given into one recurrent neural network, which essentially works as we've described in an earlier module in this course that was specifically about RNNs. And then what we get out of this RNN is the hidden layer of the recurrent neural network after each of these words has been fed into the network. So we could just take the hidden layer at the very end, which summarizes this entire sequence. What they do instead is to take all of these hidden layer states, basically after feeding each of these words into the network. And then again, these are combined using max pooling. And now this max pooling results in one vector. And this is then the description vector that summarizes the entire description. And then given these two vectors, it basically looks like in the overview figure. So given these two vectors, cosine similarity is computed so that hopefully after training, if code and description are similar and the description describes a piece of code, then the cosine similarity is high. And otherwise it should be low. So let's now have a look at how this model is actually trained. So the idea for training the model is that the approach looks at these triples of a code snippet, a matching natural language description or query that describes what this code snippet is about, and a non-matching description that describes something else but not what is really in the code. So we have these triples of the code C, the matching description D plus, and a non-matching description D minus. And then what we want to have is that for the matching pair, so for the pair C and D plus, the predicted cosine similarity should be high because this description is actually describing what the code is about. Whereas for the non-matching parts, so C and D minus, this cosine similarity should be low because this code and this description are just not supposed to be similar. And this is what is expressed down here in this loss function. So ideally, the loss should go very close to zero. So if the model works well, the loss will be very small. And this loss is computed as the sum over all of these triples that are created from the training data. And then we have two calls of this cosine similarity where we call the model that contributes to the loss. What contributes positively, so what will incur additional loss is if the cosine similarity between C and D minus is high. Whereas the cosine similarity between C and D plus only contributes negatively because what we want at the end is that for C and D plus, this cosine similarity is pretty high. To make sure that this loss per triple is always positive, there's also this max function here. And then all these losses from each of these triples are added up to get the overall loss. Finally, let's have a look at the results reported in this paper. So they have trained their model on 18 million Java methods and corresponding natural language descriptions, which they extracted from comments associated with these methods. So these comments are not really exactly what a query might look like that is given to a code search engine, but they basically take this as a surrogate for these queries because it's difficult to get millions of queries associated with code, but it's relatively easy to get millions of comments associated with pieces of code that are written up in a method. And once they've trained the model on this data, then they evaluate whether the model actually works in the intended way by picking 50 questions that people have asked on Stack Overflow and basically checking if the text that is given in the question, if you feed this into the search engine, actually retrieves a code snippet that also showed up in the answers to this question. And what they found is that the correct code snippet is actually predicted at position one or two for most of these queries. So the way these positions are computed is by basically querying this model that we've seen multiple times, each time asking, hey, is this code similar to this given query or is this code similar to this given query? And then the code that is most similar to the given query is ranked at position one. The code that is second most similar is ranked at position two and so on. All right, and that's already the end of the third part in this module on how to actually model the token vocabulary of a piece of code. And what we've seen specifically in this third part is that sometimes it's not only useful to model the tokens that you have in code, but you also want to model natural language information. And in this case, even embed both of these pieces of information into the same vector space. Thank you very much for listening and see you next time.