 Hi, welcome everybody to Analyzing Software Using Deep Learning. We are in summer 2020 and this is a course at University of Stuttgart. What we do in this module of the course is to look at sequence to sequence models. So this is a particular kind of neural network architecture and how to use them for analyzing software. So let me just give you a brief overview of what we want to do in this module. So as usual, there are three parts. The first one is about the architecture itself. So we will look in how sequence to sequence networks actually work and why they are useful and when they should be used. And then we'll have a look at two applications of them that use them to analyze software. The first one is about using them to find API usage sequences based on a natural language query. So basically you're asking for a particular API and how to use it by just expressing yourself in natural language. And then a model will give you an example, a code snippet that may be what you're looking for. And then in the third part, we will look at interpreting Python programs. So it's kind of a crazy idea where we are training a model to interpret a program. So to basically execute a program and tell us what results it will produce. All of this is based on the sequence to sequence networks. So we'll start by looking at what these networks are actually about. All right, so what are these sequence to sequence models? So the basic idea is that you have a sequence of items that you want to translate into another sequence of items. These items can be many different things. So you can use this overall architecture for many, many different purposes. But the basic idea is always the same, you have a sequence of something and you want to translate it into another sequence of something. So just to give you an idea of some of the applications of this kind of model, beyond what we will discuss in the rest of this module of the course. So here's a list of a few of the applications where sequence to sequence models are used for. So one, and it's maybe the most popular at all, is to translate between different natural languages. So in that case, the input sequence would be, say, a sequence of words in French and the output sequence would be a sequence of words in, say, English. Another example is to generate image captions. So in this case, the input would be some representation of an image. For example, a sequence of pixels and the output would be a caption. So a sequence of words that describes what is on the image. Yet another application is to summarize videos into texts. In this case, every item in our input sequence is a frame of the video. So basically one image that is part of this video. And then what the model does is to summarize what you see in this sequence of images into some text, which is then emitted as a sequence of words. And finally, a fourth application that sequence to sequence models are also used for is to answer natural language questions. So here the input is a sequence of words and the output is also a sequence of words that hopefully answers the question that is asked in the input sequence. So let's have a look at how this architecture, this sequence to sequence architecture typically looks like. So I'll start by giving an overview of the architecture. And then after that, we will look into some of the components of this architecture in more detail. So at a high level, this architecture consists of two sub-networks, if you want, or two neural models that are part of this overall model. So one is the encoder, which is essentially taking the input sequence and then encodes it into a vector representation. And typically, this is implemented as a recurrent neural network, as an RNN, which we have already seen in the past module of this course. And then what the encoder does, given the input sequence, is to encode it into a single vector, which is called the context vector. Let me first add the input. So the input is a sequence of some items and let's say the length of this input is m. And then what the encoder does is to summarize this input into this context vector, which may have any size. Typically, it's a relatively short vector of maybe 100 or 200 numbers. And then, given this context vector, the second main component of this sequence-to-sequence architecture is a decoder, which typically is also implemented as a recurrent neural network. And what this decoder does is to take the context vector and to emit a sequence, which is then the output sequence of this whole model. And this sequence may have length n. So what you notice here is that m and n are different letters. And the reason is that they may actually also be different length. So m does not have to be the same as n. So you can have an input length that is different from the output length. So to illustrate this, let me just give you a simple example, which is not related to software, but is related to one of the other applications of sequence-to-sequence models. And then we're translating natural language from one language into another. So let's say our input sequence is a sentence in German. And let's say it's this extremely important German sentence Staubsaugers in And let's say we want to translate this into English, then the output sequence could be something like vacuum cleaners are noisy. And what you see in this simple example is, first of all, that both the input and the output are sequences, namely sequences of words in this case. And you also see that the length of these sequences may differ, because the input has a length of three and the output has a length of four. Now, given this overall architecture, one question you might ask is how all of this is trained. So one option would be to train the encoder and train the decoder separately and then to stitch everything together. But what we do here instead is to train both networks jointly, which has the big advantage that the encoder and the decoder basically know what the other one is doing. So the encoder will encode the input sequence into a context vector that the decoder is able to understand. And to get this, the key idea here is that both networks are trained jointly. And the whole goal of this training and the reason why we have this context vector in the middle is that the context vector essentially summarizes all the input. And it does not just summarize it in some way, but it summarizes it in a way that is suitable to generate the output sequence. And that also means that, for example, some details of the input that are not needed to reduce the output may not be part of the summary because the context vector only needs to provide the information that helps the decoder produce the right output sequence. So yeah, so the summary must be suitable to generate the output sequence. All right, so this is an overview of the sequence-to-sequence architecture. And next, let's look into some more detail of how this encoder and decoder really works. So let's start with the encoder on N and how this component of the sequence-to-sequence model works. So the encoder is a recurrent neural network. So this will be a bit similar to something we have already seen in another module of this course. And similar to what I did previously, I'll now again show you the time unfolded representation of this network where we basically see the different steps that this network is taking given the different items in the input sequence. So the input is split into different items. So we have some item x at time t minus 1. And then we have another input that comes in at time t. So this is x of t. And then this goes on and on until some time point tau where t equals tau just means that this is the final time step. And when I say time, I do not really mean time as an wall clock time, but this is really about the sequence of items that we get as an input. So tau means this is the last item in the input sequence. And now with each of these time steps, we have a hidden layer inside our R and N. And the input is combined with the state of the hidden layer that we get from the previous input that we've seen. So at t minus 1, this means we are basically producing h of t minus 1 given what we get from the previous step. And this is done based on two matrices that basically control how the inputs are handled in order to get the next hidden layer. So we have this matrix w here and the matrix u here. And we'll see in a second how this really is put together. Then a time step t, the same happens basically again. We have this h of t, which now takes the h of t minus 1, combines it with w and the same happens down here with the input x, which is then combined with u. And this goes on and on and on until we reach the last time step where we have this h of tau. And the final output of this h of tau is then again multiplied with another matrix called v. And this gives us a vector called y of tau. And this is the vector that we will eventually use as the context vector of our network. So the idea is that this output vector here is the fixed size vector that represents the entire input sequence. I'll just mark this with some color because then we can use it again on the next slide where this vector will be used as the input to the decoder RNN. Now I've basically given you a nice figure to show how this RNN works. Let me also just briefly give you the equations to really see how these different values are computed. For example, we take the hidden vector h at the time step t. We take some activation function, for example, 10h, and apply this to the different inputs that are given into the step. And this is the h, so the hidden vector from t minus 1 from the previous time step multiplied with w. We also have u multiplied with our input at this time step, so x of t. And then there also is some bias added to all of this. And then eventually to get y of t, or actually it's y of tau because we do this at the very end, we take this other matrix v multiplied with h of tau, so the final hidden state plus some other bias, which is called c here. Of course this doesn't have to be 10h, so basically this function here could be any other activation function as well. So now you've seen the encoder RNN and what it produces is this little vector that I've marked with pink color, that is the context vector, which summarizes all we know about or all we need to know about the input sequence. Now the second component of our sequence-to-sequence architecture is the decoder RNN. And let's have a look at how this works. So what it gets as an input is this vector, which was previously called y tau, but here it's called x because now it's the input to this decoder model. Just to show that this is actually the same vector as what we had before, let me just mark it again here with the same color. So just going back, so this is exactly the same pink vector that we have obtained as the output of the encoder RNN, which now here serves as the input to the decoder. So what we want to eventually get as the output of the decoder is a sequence of output items that each are represented as one vector. So we will have a time step t minus 1, some vector y that is produced, same for t, so we have y of t, and then same for y of t plus 1. And actually this goes on and on and on until we are at the end of the output sequence. Now how are these outputs produced? Well, again, there's a hidden state in our recurrent neural network and for every time step, you basically have one state of this hidden vector h. So we have an h of t minus 1, and it takes our input x as one input. It takes the previous hidden state as another input and then uses this to produce the output at this time step. All of this is controlled by different weight matrices as usual. So there's a weight matrix that is multiplied with the input x. This is called r. There is a weight matrix that is multiplied with the previous hidden state. This is w prime. And a weight matrix that is then multiplied with the hidden vector at this time step in order to get the output vector of this time step. And this is called v prime. And then the same basically happens for every time step. So we also have a hidden vector time t, which takes the previous state of the hidden vector and the same input x. So this input x is basically fed into the decoder for every time step again and again and again. And this then produces our output at time t. All of this is controlled by the same matrices again. And then the same also for t plus 1. And of course also for any other time steps that follow before and afterwards. So the crucial link between the dn-coder and dd-coder is this vector down here, because this is the fixed size vector that not only summarizes the input, but it's also used to generate the entire output sequence. Let me also give you the formulas for computing the values of h and also of y at every time step. So if you want to compute the hidden state at some time step t, then we again take the hidden vector of the previous time step, multiply it with w prime. We also take our input, which is multiplied with r. And then we also have some bias here called b prime. And as usual, all of this goes through some activation function, for example 10h. And then to compute the output, we now look at the hidden state at a time step t, multiply it with this matrix v prime and also add some bias c prime. And now because what we want to predict here is an item in a sequence, we basically need to give a probability distribution over a set of possible items. So the idea is we have some output vocabulary, for example, all the words we have in the output language. And in order to predict which of these words is the next in our sequence, we need to basically give a probability to each of the possible output words. And from one of the previous modules of the course, you already know how to do this. It's by applying a softmax function here so that we basically get a probability distribution as each of these output vectors y, which will then be interpreted, for example, by taking always the output item with the highest probability. So now you've seen the architecture. Let's now have a look at how this sequence-to-sequence model is trained. So as usual for training, we need some input data and as usual we here look at supervised learning. So we basically need pairs of input and output sequences. So let's say here we have n of such pairs and each pair is called xi, yi, all i between 1 and n, where xi is the input sequence and yi is the output sequence. And in order to tell the model when these sequences are over, we need to mark the end of each of these sequences with a special symbol and we just call this special symbol EOS, which stands for end of sequence. So why do we need this EOS? The main reason is that we need the output sequence or we need to give the decoder a way to tell us when the sequence is over. So otherwise it would basically continue predicting more and more words and we wouldn't really know when to stop and by giving this special symbol for the end of the sequence we allow the decoder to basically indicate that now it's enough, I've predicted all the words that I think should be in the output sequence and now you should stop. So let me illustrate this with the example of German to English translation that we have already seen earlier. So in this case one possible input sequence to train the model would be this sentence Staubsauger sind laut where each of the words is one item in our sequence and at the end we have the special symbol for end of sequence and then the corresponding Y, so the corresponding output would be vacuum cleaners are noisy so these are the individual words in the English output sequence. And here again in order to mark the end of the sequence we have this EOS at the end. So now for the model to learn how to translate the input sequences into the corresponding output sequence we need to adapt the weights and biases in our model and we do this by optimizing a given function and this function is the following so we have a specific goal for our training where we intuitively want to minimize the mistakes done during this translation from input sequence to output sequence so we want to have a minimal number of incorrectly predicted output items in the output sequence. So what we want to do here is to minimize the following function so this function looks at the probability of predicting given a particular input sequence a particular output item and this is YIT where I means this is the ith sequence of outputs that is produced so XI and YIR are one input output pair and then YIT is the teeth item in the output sequence YI and we look at this probability or more specifically at the negative log probability for each of these so this is basically for one item in one output sequence and now what we want to do is we want to sum this negative log probability over all the items in one output sequence so T is the length of each output sequence and now we do not just want to do this for one sequence but we actually want to do this for all our input output sequences from I equals 1 to N and then average over all of those so we just divide by N and this formula basically is the objective function here because this is what we want to minimize because the lower the result of this function the better our prediction is so just to have on the slides what these different symbols mean here so T is the length of the output sequences and this probability of YIT given XI is the probability of a particular word YIT given the input sequence XI so now once we have trained this kind of model what we want to do next is to predict some output sequences for some given input sequences so essentially we want to use it for translation and in many practical applications we do not just want to have one translation let's say the most likely translation according to the model but we want to see the K most likely translations for example this could be useful in case the model is wrong sometimes because then you can look at the K most likely translations and hopefully one of them is the one that you are really interested in now how can we do this because by default what a model does is to predict one output item after another for a given input sequence so we just get one sequence and this is the most likely one but we do not have a way to get the K most likely translations the answer to this question is to use something called beam search so the basic idea is that we have like a light beam that tell us what are the most likely tokens or items in the output sequence and we do not really just focus like a laser on the one most likely token but we make the beam a little wider and look at different possible items in the output sequence so specifically for every item or word in the output sequence we consider the K most likely alternatives so let's say we are predicting the first item then we look at the K most likely items for the first one and then extend the partial sequence that we get this way in K ways at every time step and at every time step we keep only the K most likely partial sequences by basically pruning away all the other possible sequences that we have already seen so that at each time step we have K possible sequences and this means we also have K possible sequences at the very last time step which then gives us the K most likely translations so let me illustrate this idea using our running example of this translation from German to English and for the example let's assume that K is equal to 2 which means we are interested in the two most likely translations so these different possible sequences that the model is considering and that we are considering here in this beam search can be represented as a tree and we start this tree at the very beginning where we haven't predicted any output word yet but basically ask the decoder for the first time hey what is the first sequence or what is the first output item that you want to predict for the given input sequence and let's say for this example we get a couple of predictions one is that the first output should be vacuum and let's say the second most likely so the probability for this prediction of vacuum let's say it's 20% and let's say there's another prediction that says the first word should be cleaness and let's say the probability for this prediction is 18% and then of course there will be many other predictions with lower probability now because K is equal to 2 we do not really care about the others we just focus on the two top most predictions now for each of those we are asking the model hey given this first word what do you think should be the next word and let's assume that in case of vacuum as the first word we get let's say again a couple of different predictions here and again we do not care about those beyond the top two because K is equal to 2 so we only look at the most likely prediction for the second word and let's assume for the example that this happens to be cleaness predicted with a probability of 25% and let's say the second most likely second word following vacuum is again vacuum let's say with a probability of 5% now this basically looks at two possible ways to continue the sequence if it started with vacuum let's now also look at ways of continuing the sequence if the sequence started with cleaness and here we would for example see that R is the most likely second word if the sequence starts with cleaness let's assume that this is predicted with a probability of 15% and let's say the second most likely option for the second word if the sequence starts with cleaness is vacuum with a probability of 7% and now because we only care at the end about the K equals 2 most likely sequences we will now prune this search space that we are basically expanding here by multiplying the different probabilities with each other so we will take the 20% of vacuum and multiply it with the 25% of cleaness this gives us the probability of this being the right path in our beam search and we do the same for all the other paths and then what we find is that out of the four options that we have on the table right now these two are less likely than the other two so basically vacuum cleaness and cleaness R are the most likely two sequences given the tree of options that we have explored so far so what we will do now is to keep only these two and then keep further expanding them so for cleaness we will ask the model hey what could be the third word if the first two words are vacuum cleaness and then maybe it tells us R with some probability and clean with some probability and some other words with some lower probabilities and we also ask the model hey what would be the most likely third word if the sequence starts with cleaness R and maybe here it says noisy would be the most likely to finish this sequence clean would be the second most likely and then there are other options as well and then again we do this pruning that we have just seen so we would basically compute the different probabilities which I have not given now for the third step but I guess you get the idea now and let's say this would remove these second options in each case which basically leaves us with vacuum cleaness R as one of the two most likely ways to predict the output sequence and cleaners are noisy as the second most likely and then this of course goes on and on and on until at some point so for each of the remaining options we always look at the two most likely ways to continue this sequence and this goes on and on and on until eventually we are reaching the end of sequence symbol which tells us that now the model thinks we are done so what this basically gives us is a way to get the K most likely output sequences without really querying the model for all possible output sequences and computing the probability for each of them because this will not really scale but instead we are pruning the search at each time step in order to just focus on the K most likely sequences at a given point alright so this brings us almost to the end of this first part in this module on sequence to sequence architectures just to make sure that everybody understood something I have a little quiz which basically gives you four sentences some of which are correct and some of which are wrong and the task for you is to now stop the video for a moment carefully read these sentences and think about which of them are actually correct and which of them are wrong alright so let me show you the solution so the first sentence is correct so this context vector can indeed be a potential bottleneck that may prevent a network from effective learning so as an extreme case suppose we just make this context vector of size one which basically means there is one real valued number that needs to summarize the entire input sequence and this is unlikely to work for any complex translation task the second sequence the second sentence the length of the input sequence must be the same across all instances of the training set this is wrong so because the encoder is a recurrent neural network and because recurrent neural networks can handle differently sized input sequences the input sequences may have different sizes and we use this end-of-sequence token to tell the model when the end-of-the-sequence is reached the same is true for the output sequence so also here the sequence length may vary across the different data points that we see because we have this end-of-sequence symbol to allow the model to tell us when we have reached the end-of-the-sequence and then finally the last sentence is true again so each instance in the training set must indeed contain two sequences namely the input and the output and the reason simply is that we here do supervised learning so we need to have examples of inputs and outputs that come in pairs so that we can train our model cool so this is the end of the first part of this module I hope you now know sequence to sequence models are and what we'll do in the remaining two parts is to look at their applications for analyzing software so thank you very much for listening and see you in the next part