 Hi everybody and welcome to Analyzing Software Using Deep Learning. This is a course at the University of Stuttgart in summer 2020. In the second module of the course we will look at recurrent neural networks or short RNNs and how to use them for analyzing software. So this module will have three parts where the first focuses on recurrent neural networks and what they actually are. This first part is not really specific to analyzing software but basically gives a background on the machine learning side that we need to understand the rest of it. And then the second and the third part will show concrete applications of these recurrent neural networks. One of them is going to be about code completion. So how to complete code that is missing some parts. And then the third part of this module will be about repairing a specific kind of error, namely syntax errors and how to use recurrent neural networks for that. In the previous module we introduced some of the basics of neural networks and basically reasoned about them at the level of individual neurons. The first thing we will do today and also in the rest of this course is to abstract away from individual neurons and to instead look at layers of neurons. So what we'll do here is to go from neurons to layers. So to see how this works let's first have a look one more time at a neural network by looking at individual neurons. For example let's take a network that looks like this. So we have an input layer with three neurons then a hidden layer with two neurons and then an output layer with three neurons and everything is connected to everything like this. Now what the computation that is done at every individual neuron is the following. So the output of every neuron is some function and this could be any of these activation functions for example that we have seen and then for every neuron we take the weight of the incoming connection times the input that comes in plus some bias. And in this case these x and f and b values are all scalars. For example they could just be real numbers and the w here is this vector of weights. So this is a vector. For example this could be a vector of n real numbers if there are n input neurons or inputs that come into a neuron. Now instead of this look at individual neurons we will now abstract them into layers of neurons that I will represent using these little rectangles. And then the same network that we just looked at will look like this where we basically have these three layers an input layer a hidden layer and an output layer. And now we can express our computation also in terms of layers where we say that for each layer the following is computed as the output. So the output is again some function f which may be an activation function for example but now we have capital W times x plus b and to reflect the fact that we do not talk about individual neurons anymore but about layers these values x and f and b are now all vectors. For example vectors in r to the power of n and capital letter w is now a matrix which is our matrix of weights and this could for example be in r to the power of m times n where n is the dimension of input connections that come into a layer and n is the number of neurons that we have in the layer. Based on this notion of layers we can now have a look at recurrent neural networks which will be the focus of this module. So what we have seen so far are feedforward networks and feedforward networks are not recurrent networks. And what we want to look at mostly in this module today are recurrent networks. So what's the main difference between them a feedforward network basically looks like this we have some input layer we have some hidden layer maybe some more hidden layers in between and then eventually some output layer. Let's call this input layer x the hidden layer h and the output layer y and let's also give these matrices some names that control how the inputs or how the outputs of one layer are transformed to give us the outputs of the next layer and let's call these matrices u and and v so x and h and y here correspond to the input layer a hidden layer and an output layer and u and w u and v and w which we'll introduce in a second correspond to these weight matrices that control how the layers actually work. Now the feedforward network on the left just processes one input at a time and in contrast to this a recurrent network looks as follows so it also has an input layer it also has a hidden layer maybe more of them and then eventually has an output layer and again we call them x h and y and also the matrices that control what happens between these layers is basically the same again u and v but what's new now is that we also have a connection from the hidden layer to the next time step of this same hidden layer and I'll use this little rectangle here as a symbol to denote that this is a connection that goes over time and we'll see in a second what this over time really means and this connection over time is controlled by yet another weight matrix w which tells us how the value of h at some time step t depends on the values of this value of this layer h at a previous time step t minus one so just to give my arrows a little bit more meaning here so if we if I use this normal error then this is just a normal function and if I use this arrow that has this little black rectangle on top then this is a function that also has a delay of a single time step why do we care about this delay here well it's important for two reasons one of them is that it is very useful for representing sequences of inputs and also sequences of outputs and the basic idea here is that every element of such a sequence is represented as one time step and then by going through time we can basically process the entire sequence and the second reason why this is useful is that this hidden layer that takes some information from the past in a sense is able to store some information about previous inputs so if you are given a sequence of elements then what this hidden layer essentially does is to store some information about the past elements that we have already seen so that we can use all of this information to make want to determine the output of the hidden layer let's make all of this a little bit more concrete by looking at the concrete example and this example will be to predict the next word in a sentence so as a concrete sentence let's say we have a sentence that starts with A S D L the name of this course is the best and then the missing word comes and as you may be able to guess the word that we would like to see here is course and now let's have a look at how we could get this prediction from a feedforward network and from a recurrent network so for a feedforward network what we could do is the following we could basically feed one word at a time into the network and then it's supposed to predict the next word for us so for example we would start with A S D L and hope that the network is able to predict is for us and then we do this for all following words and at the end we hopefully query it for best and it hopefully gives us the answer course now this is really hard for the model because it basically sees only one word at a time so given the word best it is supposed to answer course but it doesn't really know the rest of the sentence so it's it's it's very hard to to be right here in contrast a recurrent network will see all the words that have already happened in the past so basically the beginning of the sentence and we'll use this to predict the rest of the sentence so at time step one we will provide the first word to this model so this looks very similar to what we do for the feedforward network and now the difference is that at later time steps we still have the information from the words that we have seen at the beginning because of this recurrent layer that is the important feature of a recurrent network and that means when we later at time step 4 see the query that starts the query given the word best it now sees the information from from the past and is hopefully able to predict course simply because in this hidden layer here we're storing information about the beginning of the sentence so this recurrent connection remembers the beginning of the sentence so now what does it really mean to remember something let's have a deeper look at what this really means by unfolding the computational graph which means by basically looking at what happens in the model during these different time steps so by unfolding the computational graph I essentially mean that we'll look at individual time steps instead of just having this recurrent little arrow that doesn't really show us what's going on under the hood so the not yet unfolded version of this model and I'm just writing it yeah slightly differently looks like this we have this input layer the hidden layer and the output layer and they're connected just as before with the only difference now that I've basically turned it on the side and we have these matrices here that control how these different layers are connected with each other and now the unfolding happens where we look at individual time steps so we have this notion of the past which I just abbreviate with dot dot dot the past gives us something we'll see what this is in a second and then for every time step we have some input some hidden layer and some output so here this is time step t minus one so this is the input x of t minus one which is given to the hidden layer at t minus one which will then produce the output y at t minus one and then at the next time step this output of h of t minus one is taken as the next input to the hidden layer so the hidden layer at time step t and in addition of course it gets the next input so this will be the input at time step t and then produces an output at time step t and this goes on and on and on for different time steps basically until the end of our input sequence so until this reaches the end of the sequence now let's get back to our concrete example so for example at t minus one the input might be is and the model may be able to predict that the output should be there then we take this output that was predicted by the model and feed it into the next time step and now what the model knows is that the is the next input but it also knows something about the past because of this connection from h of t minus one and combining these two pieces of information it may be able to predict best and then we basically do the same we take this predicted output best and feed it into the model again which also takes all the information from the past through the connection from h of t and then hopefully is able to predict course as the as the missing word so let's also have a look at how this works mathematically so basically the output of h at some time t is a function of the state of h at t minus one and of the input x at time step t and then the output of the whole model at a time step t so y of t is a function of h of t which is defined just above to make this more concrete let's have a look at how this could be typically implemented so for example this could be implemented as follows that h of t is the hyperbolic tangent 10h of the state of h of t minus one multiplied by this weight matrix w that we also see on the folded image on the left plus the input that we get from x at time step t multiplied with the other weight matrix that we also see on the left namely u plus some bias the hyperbolic tangent is just yet another activation function so let me just give you an idea of how this looks like so it basically looks similar to the sigmoid function that we've seen earlier just that this is now shifted a little towards the negative numbers and looks something like this so it comes in from here then goes through zero zero and then goes up here so it again has this s like shape but can have a negative value as well as a positive value to give you a concrete example of how this function of y of t could look like so one option that people might use if they want to predict words out of a given vocabulary for example to predict the next word in a sentence would look as follows so we would say y of t is the result of the softmax function and we'll see in a minute what this really means of the value of h of t multiplied with the weight matrix v plus some other bias which we here call c just to disambiguate it from the bias that we use for the hidden layer so let's now have a look at what this softmax function is and why we want to use it here to compute the output of y of t so what softmax does is to essentially give us a vector that we can interpret as a probability distribution so it takes a vector of k numbers that are all real real valued numbers and then squashes them such that they sum up to one and such that each of these values is in the range between zero and one so essentially this looks like a probability distribution and if we interpret this interpret the output of a softmax function as a probability distribution we can for example predict how likely each of the different words that might come next are according to the model so let's have a look at how this is defined so in order to compute the softmax function this formula here is computed so we take all the elements in our vector y and then for each of them compute e to the power of y divided by the sum of all these e to the power of y values so for example if we have this value given which does not look like a probability distribution at all then what softmax will do for us is to turn it into this vector which basically has the same gives the same importance to each of these values so the value four here in the middle corresponds to this one and it still is the largest value but this vector that we get from softmax now looks like a probability distribution so the sum of all the values is one and each of the values is between zero and one so to double check if this idea was clear let's have a little quiz where you're basically given a few vectors and the question is which of these vectors could be the output of the softmax function so I invite you here to pause the video in order to think independently about these four vectors and once you think you know which of those could be the outcome of the softmax function then resume the video so let me show you the solution so the first one is clearly not a possible output of the softmax function because it's all zeros so it does not look like a probability distribution simply because its values do not sum up to one the second one looks fine all the values are in the zero to one range and the sum of all these values is one the same for the third one which is a kind of strange probability distribution because it basically gives all the probability to the second element but it is one so that's fine and the third one again looks sorry the fourth one looks fine at first because all the values are between zero and one but the sum is not equal to one and therefore this is not a legal probability distribution and hence not a possible outcome of the softmax function so you now have an idea of what these are and ends these return neural networks are before we look into applications for analyzing software let's have a look at some applications and other domains where are and ends are useful so essentially you can use them for any task where the input and maybe also the output is a sequence of something so just to give you a few examples so one concrete application where this could be useful is for unsegmented connected handwriting recognition where you basically have handwriting that is not yet split into individual characters and the model simply gets a sequence of these pixels that represent the different chunks of your handwriting and then is supposed to predict the actual characters or maybe digits another example is machine translation of natural languages where we have sequences of words in one language given as the input and then would like the model to predict a sequence of words in some other language for example to translate from English to German yet another example is video classification by frames where the sequence is a sequence of frames that built up the video and then we want to classify this video for example to say whether this is about cats and dogs or about something else yet another example is speech recognition because speech naturally is a sequence it's just a sequence of of audio that comes in and this sequence can also be fed into an RNN in order to recognize what a person is actually saying and as a final example what you can also do with RNNs is to do a sentiment analysis of Twitter messages where you want to find out whether the person writing a message is angry or happy or or whatever and here the sequence would be a sequence of words or maybe a sequence of characters that make up this Twitter message so and this already brings us to the end of this first part of this module on using recurrent networks for analyzing software so you now hopefully know or at least have an idea of what a recurrent neural network is and in the second and third part we can now see how to use these recurrent neural networks for two applications in software namely code completion and automated program repair thanks for listening and see you next time