 Yeah, so we talked about pseudo labeling a Couple of weeks ago, and this is this way of dealing with semi supervised learning So remember how in the state farm competition? We had far more unlabeled images in the test set than we had in the training set And so the question was like how do we take advantage of knowing something about the structure even though we don't have labels? And we let this crazy technique called pseudo labeling or a combination of pseudo labeling and knowledge distillation Which is where you? predict the outputs of the test set and Then you act as if those outputs were true labels and you kind of add them in to your training and The reason I wasn't able to actually implement that and see how it works was because we needed a way of combining two different sets of batches and in particular, I think the advice I saw from from Jeff Hinton When he wrote about pseudo labeling is that you want something like one in three or one in four of your Training data to come from the pseudo labeled data and the rest to come from your real data so the good news is I built that thing and It was Ridiculously easy. This is the entire code. I called it the mix iterator And it will be in our utils class from tomorrow And all it does is it's something where you create? Whatever generators of batches you like and then you pass an array of those iterators to this constructor and then every time it So the Keras system calls next on it it grabs the next batch from all of those sets of batches and concatenates them all together and So what that means in practice is that I tried doing pseudo labeling for example on on MNIST because remember on MNIST We already had that you know pretty close to state-of-the-art result Which was? Just to remind ourselves 99.69 So I thought okay. Well, can we improve it anymore if we use pseudo labeling on the test set and so to do so You just do this you grab your Training batches as usual using data augmentation if you want whatever else so here's my training batches and Then you create your pseudo batches by saying okay my label of my My data is my test set And my labels are my predictions and these are the predictions that I predict that I calculated back up here So now this is a second set of batches Which is my pseudo batches and so then passing an array of those two things to the mix iterator Now creates a new batch generator which is going to give us a few Images from here and a few images from here. How many well however many you asked for so in this case I was getting 64 from my training set and 64 divided by 4 from my test set Come to think but that's probably less than the internet recommends that probably should have been divided by three And now I can use that just like any other Generator so then I just call model dot fit generator and pass in that thing that I just created And so what it's going to do is it's going to create a bunch of batches which will be 64 items from my regular training set and a quarter of that number of items from my pseudo labeled set and Lo and behold It gave me a slightly better score. There's only so much better. We can do at this point But that took us up to 99.72 It's worth mentioning that every point oh one percent at this point is just one image So we're really kind of on the edges at this at this point But we're you know, this is getting even closer to the state of the art despite the fact. We're not doing any hand writing specific techniques I also traded on the fish data set And I realized at that point that this allows us to do something else, which is pretty neat Which is normally when we train on the training set and set aside a validation set if we then want to submit to Kaggle We've only trained on a subset of the data that they gave us we didn't train on the validation set as well Which is not great, right? So what you can actually do is you can create send three sets of batches to the mix editor You can have your regular training batches You can have your pseudo label test batches And if you think about it you could also add in some validation batches using the true labels from the validation set So this is something you do just right at the end, you know when you said okay, this is a model I'm happy with you could fine-tune it a bit using some of the real validation data and You can see here. I've got out of my batch size of 64. I'm putting 44 from the training set for from the validation set and 16 from the pseudo label test set and Again this worked pretty well it got me From about 110th to about 60th on the on the leaderboard. Yes Yeah, you can use the sample weight, but You would still have to manually construct the consolidated data set. So this is like a More convenient way where you don't have to append it all together and deal with a lot Yeah, I will mention that I found the the way I'm doing it seems a little slow There's some obvious ways I can speed it up like I'm not quite sure why it is but it might be because I'm like this concatenation each time is kind of Having to create new memory and that takes a long time like there's some obvious things I could do to try and speed it up, but Yeah, it's good enough and seem to do the job. So I'm pleased that we now have a way to do convenient pseudo labeling in In keras and it seems to do a pretty good job So the other thing I wanted to talk about before we moved on to the new material today is embeddings because I've had lots of questions about embeddings and I Think it's pretty clear that at least for some of you some some additional explanations would be helpful so I wanted to start out by reminding you that When I introduced embeddings to you the data that we had we looked at kind of this cross-tab form of data And it was in this cross-tab form. It's very easy to visualize what embeddings look like which is you know for movie number 27 And user ID number 14 here is that movie IDs embedding right here And here is that user IDs embedding right here. And so here is the product of the two right here So that was all pretty straightforward And so then all we had to do to optimize our embeddings was use the gradient descent solver that is built into Microsoft Excel which is called solver and we just told it What our objective is which is this cell and we said to minimize it and we said to minimize it by changing these sets of cells Now The data that we are given in the movie lens data set however Requires some manipulation to get into a cross-tab form We're actually given it in this form and we wouldn't want to create a cross-tab with all of this data Because it would be way too big it would be every single user times every single movie and it would also be very inconvenient So that's not how Keras works Keras uses this data in exactly this format And so let me show you how that works and what an embedding is really doing So here is the exact same thing But I'm going to show you this using the data in the format that Keras uses it So this is our input data For every every rating is a row and has a user ID a movie ID and a rating, okay, and this Is what an embedding matrix looks like for 15 users. So for these are the user IDs And for each user ID his user ID 14s Embedding and this is 29s embeddings. This is 72s embeddings at this stage. They're just random I just initialize them with random numbers So this thing here is called an embedding matrix And here on is the movie embedding matrix. So the embedding matrix for movie 27 Are these five numbers? So what happens? When we look at user ID 14 movie ID for 17 rating number 2 Well, the first thing that happens is that we have to find user ID number 14 and here it is user ID 14 Is the first thing in this array? So the index of user ID 14 is 1 So then here is the first row from the user embeddings Embedding matrix similarly movie ID number Movie ID 417 You have to scroll down Here is movie ID number 417 and it is the 14th row of this table And so you want to return the 14th row and so you can see here It has looked up and found that it's a 14th row and then indexed into the table and grabbed the 14th row and So then to calculate the dot product We simply take the dot product of the user embedding with the movie embedding and then to calculate the loss We simply take the rating and subtract the prediction and square it and then to get the total loss function We just add that all up and take the square here so the orange background cells the cells which we want our SGD solver to change in order to Minimize this cell here and then all of the other all of the orange bold cells are the calculated Cells so when I was saying Last week that an embedding is simply looking up an array by an index You can see why I was saying that right? It's literally taking an index and it looks it up in an array and returns that row. That's that's literally all it's doing You might want to convince yourself during the week if you haven't looked at this yet that this is identical to taking a one-pot encoded matrix and Multiplying it by an embedding matrix. That's identical to doing this kind of look up So we can do exactly the same thing in this way. We can say data solver We want to set this cell to a minimum by changing these cells And if I say solve then Excel will go away and try to improve our objective And you can see it's decreasing. It's about two and a half And so what it's doing here is it's using gradient descent to try to find ways to Increase or decrease all of these numbers such that that RMSE becomes as low as possible So that's literally all that is going on in our Keras example Here this dot product. All right, so this thing here where we said create an embedding for a user That's just saying create something where I can look up the user ID and find their row This is doing the same for a movie look up the movie ID and find its row And this here says take the dot product once you found the two And then this year says train a model or you take in that user ID and move you ID and try to predict the rating and Use SGD to make it better and better Excel is really going along. I'll cancel it. So you can see here That it's got the root mean squared error down to point four So for example the first one Predictor three. It's actually two, four and a half. It's actually four, four point six, five and so forth So you kind of get the idea of how it works Word embeddings work exactly the same way so inspired by one of the Students who talked about this during the week. I grabbed the text of green eggs and ham And so here is the text of green eggs and ham. I am Daniel. I am Sam. Sam. I am that's and I am etc and I've turned this poem into A matrix and the way I did that was to take every unique word in that home Here is the ID of each of those words just in extreme one and So then I just randomly generated an embedding matrix. I equally well could have used the downloaded glove in beddings instead And so then just for each word I just look up in the list to find that word and find out what number it is So I is number eight and so here is the eighth row of the embedding matrix So you can see here that we've started with a poem and we've turned it into a Matrix of floats and so the reason we do this is because our machine learning tools want a matrix of float floats not a poem So all of the questions about like does it matter what the word IDs are? You can see it doesn't matter at all. All we're doing is we're looking them up in This matrix and returning the floats and once we've done that we never use them again We just use this matrix of floats. Okay, so that's what embeddings are Okay, so I hope that's helpful feel free to ask if you have any questions either now or at any other time Because we're going to be using embeddings throughout this class So hopefully that helped a few people clarify what's going on. Okay, so let's get back to recurrent neural networks so to remind you We talked about the purpose of recurrent neural networks as being really all about memory So it's really all about this idea of memory if we're going to handle something like Recognizing a comment start and a comment end and Being able to keep track of the fact that we're in a comment For all of this time so that we can do modeling on this kind of structured language data We're really going to need memory That allows us to handle long-term dependencies And it provides this stateful representation. So in general the stuff we're talking about We're going to be looking at things that kind of particularly need these three things and it's also somewhat helpful just for when you have a variable length sequence Questions about embeddings One is how does the size of my embedding depend on the number of unique words? So mapping green eggs and ham to five real numbers seems sufficient, but wouldn't be for all of JRR token So your choice of how big to make your embedding matrix as in how many latent factors to create is one of these architectural decisions, which we don't really have an answer to My best suggestion or there's a few one is to read the Word2vec paper Which kind of introduced a lot of this or at least took it a lot further that had gone and Look at the difference between a 50-dimensional 100-dimensional 200 300 600 dimensional and see look see what are the different Levels of accuracy that those different size embedding matrices created when those the authors of that paper Provided this information So that's a quick shortcut because other people have already Experimented and provided those results for you The other is to do your own experiments try a few different sizes It's not really about the length of the word list it's really about the complexity of the language or other problem that you're trying to solve and That's really problem-dependent and will require both your kind of intuition developed from from reading and experimenting and also your own experiments And what would be the range of root mean square error value to say that a model is good? To say that a model is good is another Model specific issue so a root mean squared error is very interpretable. It's basically how how far out is it on average? So if you're we were finding that we were getting ratings within about point four, I mean obviously this data this Mini Excel data set is too small to really Make intelligent comments, but let's say it was bigger If we're getting within point four on average that sounds like you know It's probably good enough to be useful for helping people find movies that they might like But there's really no one solution I actually wrote a whole paper about this Let's see if I can find it Yeah, if you look up designing great data products and look at my name This is based on Really mainly 10 years of work I did at a company I created called optimal decisions group and optimal decisions group was all about How to use predictive modeling not just to make predictions but to optimize actions and this whole paper is About that in the end. It's really about coming up with a way to measure the Benefit to your organization or to your project of getting that extra zero point one percent accuracy and There are some suggestions on how to do that in this paper. Okay, so we looked at A kind of a visual vocabulary that we developed for writing down neural nets where Any colored box Represents a matrix of activations That's a really important point to remember a colored box represents a matrix of activations So it could either be the input matrix. It could be the output matrix or it could be the matrix that comes from taking an input and putting it through like a matrix product The rectangle boxes represent inputs the circular ones represent hidden so intermediate activations and the triangles represent outputs Arrows very importantly Represent what we'll call layer operations and a layer operation is anything that you do to one colored box to create another colored box And in general it's almost always going to involve some kind of linear function like a matrix product or a convolution and It will probably also include some kind of activation function like value or softness because The activation functions are pretty Unimportant in terms of detail. I started removing those from the pictures as we started to look at more complex models and Then in fact because the layer operation is actually a pretty consistent We probably know what they are. I started removing those as well. So just to keep these simple And so we're simplifying these diagrams to try and just keep the main pieces and as we did so we could start to create more complex diagrams and so we talked about a kind of language model where we would take Inputs of a character character number one and character number two and we would try and predict character number three and so we thought one way to do that would be to create a deep neural network with two layers The character one input would go through a layer operation to create our first fully connected layer That would go through another layer operation to create a second fully connected layer and we would also add Our second character input going through its own fully connected layer at this point and to recall The last important thing we have to learn is that two arrows going into a single shape means that we are adding the results of those two layer operations together, so two arrows going into a Into a shape represents summing up element-wise the results of these two layer operations Okay, so this was the kind of little visual vocabulary that we set up last week And I've kind of kept track of it down here as to what the things are in case you forget So now I wanted to point out something Really interesting Which is that there's three kinds of layer operations going on There's ones and here I'm expanding this now We've got predicting a fourth character of a sequence using characters one two and three is exactly the same method as on the previous slide There are layer operations that turn a character input into a hidden Activation activation matrix is one here is one here There are layer operations that turn one hidden layer activations into a new hidden layer activations and Then there's an operation that takes hidden activations and turns it into output activations And so you can see here I've colored them in and here I've got a little legend of these different colors Green are the input to hidden blue is the hidden to output and orange is the hidden to him So my claim is that there's a couple of things to note The first is that the dimensions of the weight matrices for each of these different colored arrows All of the green ones have the same dimensions because they're taking an input of vocab size and turning it into an output hidden activations of size number of activations So all of these arrows represent weight matrices, which are of the same dimensionality Ditto the orange arrows represent weight matrices of the same dimensionality. I Would go further than that though and say the green arrows represent Semantically the same thing. They're all saying how do you take a character and convert it into hidden state and The orange arrows are all saying how do you take hidden state from a previous character and turn it? Into hidden state for a new character and then the blue one is saying how do you take hidden state and turn it into outputs? When you look at that way all of these circles are basically the same thing They're just representing this hidden state at a different point in time And I'm going to use this word time in a fairly general way. I'm not really talking about time I'm just talking about the sequence in which we're presenting additional pieces of information to this model We first of all present the first character and the second character and then the third character So we could redraw this whole thing in a simpler way and a more general way Before we do I'm actually going to show you in Keras how to build this model right and In doing so we're going to learn a bit more about the functional API, which hopefully you'll find pretty interesting and useful to do that we are going to use this corpus of All of the collected works works of Nietzsche So we load in those works We find all of the unique characters of which they're 86 Here they are joined up together And then we create a mapping from the character to the index at which it appears in this list And a mapping from the index to the character. So this is basically creating the equivalent of these tables Or more specifically, I guess this table rather than using words. We're looking at characters So that allows us to take the text of Nietzsche and convert it into a list of numbers Where the numbers represent the number the number at which the character appears in this list So here are the first ten So at any point we can turn this that's called idex. Okay, so we've converted our whole text into the equivalent indices At any point we can turn it back into text by simply taking those indexes and looking them up in our index to character mapping And so here you can see we turn it back into the start of the text again Okay, so that's the data. We're working with the data We're working with is a list of character IDs of this form with those character IDs represent their collected words of Nietzsche So we're going to build a model which attempts to I said words not words characters Which attempts to predict the fourth character from the previous three Right, so to do that. We're going to go through our whole list of indexes from naught up to the end minus three and We're going to create a whole list of the zero third So the zero fourth eighth twelfth Etc characters and a list of the first fifth ninth, etc. And the second six tenth and so forth So this is going to represent the first character of each sequence the second character of each sequence The third character each sequence and this is the one we want to predict the fourth character of each sequence So we can now turn these into NumPy arrays just by stacking them up together. And so now we've got our Input for our first characters second characters and third characters of every four character piece of this collected works and then our Y's our labels will simply be the fourth characters of each sequence So here you can see them so for example If we took x1 x2 and x3 and took the first Ellen beach This is the first character of the text the second character of the text The third character of the text and the fourth character of the text So we'll be trying to predict this based on these Three and then we'll try to predict this based on these three. Okay, so that's that's our data format So you can see we've got about 200,000 of these Inputs for each of x1 through x3 and for y and So as per usual, we're going to first of all turn them into embeddings by creating an embedding matrix. I Will mention this is not Normal I Haven't actually seen anybody else do this most people Just treat them as one hot encodings. So for example the most Widely used blog post about Car RNNs, which kind of really made them popular was on Draker pathies. It's quite fantastic and you can see that in his version he Shows them as being he shows them as being one hot encoded We're not going to do that. We're going to turn them into embeddings. I Think it makes a lot of sense, you know like Capital a and lowercase a have some similarities that an embedding can understand different types of Things that have to be opened and closed like a fantastic parentheses and quotes have certain characteristics That could be constructed in embedding There's all kinds of things that you know, we would expect an embedding to capture. So my hypothesis was that an embedding is going to do a better job Then that an embedding is going to do a better job than just a one hot encoding And in my experiments over the last couple of weeks that generally seems to be true So we're going to take each character One through three and turn them into embeddings by first of all creating an input layer for them and then creating an Embedding layer for that input and then we return the input layer and the flattened version of the embedding layer Okay, so this is the input to an output of each of our three embedding layers for our three input characters So that's basically our inputs So we now have to decide in our We now have to decide how many activations do we want And so that's something we can just pick So I've decided to go with 256 that's something that seemed reasonable since it worked. Okay So we now have to somehow construct something where each of our Green arrows ends up with the same weight matrix and it turns out Keras makes this really easy with the Keras functional API when you call dense like this What it's actually doing is it's creating a layer with a specific weight matrix Notice that I haven't passed in anything here to say what it's connected to so it's not part of a model yet This is just saying I'm going to have something which is Something which is a dense layer which creates 256 activations and I'm going to call it dense in So it doesn't actually do anything until I then do this. I connect it to something so here. I'm going to say Character one's hidden state comes from taking character number one Which was the output of our first embedding and putting it through this dense in layer So this is the thing which creates our First circle so the embedding was the thing that creates the output of our first rectangle This creates our first circle and so dense in is the green arrow So what that means is that we now in order to create the next set of activations We need to create the orange arrow. So since the orange arrow is different weight matrix to the green arrow We have to create a new dense layer. So here it is like a new dense layer and again with n hidden hidden outputs So if I could a new dense layer, this is a whole separate weight matrix. This is going to keep track of so now that I've done that I can create my character to hidden state which is here and I'm going to have to sum up two separate things. I'm going to take the my character to embedding put it through my green arrow and Dense in that's going to be there. I'm going to take the output of my character ones hidden state and Run it through my orange arrow, which we called dense hidden We'll put that here and then we're going to merge the two together and Merge by default does a sum Okay, so this is adding together these two Outputs in other words, it's adding together These two layer operation outputs and that gives us this circle So the third character Output is done in exactly the same way. We take the third characters embedding run it through our green arrow Take the result of our previous hidden activations and run it through our orange arrow and then merge the two together Is the first output the size of the latent fields in the embedding? the size of the latent The size of the latent embeddings we defined when we created the embeddings up here And we define them as having n-fat size and n-fat we defined as 42 All right, so C1 C2 and C3 Represent the result of putting each character through this embedding and getting out 42 latent factors Those are then the things that we put into our green arrow So after doing this three times We now have C3 hidden which is one two three here So we now need a new set of weights. We need another dense layer the blue arrow So we'll call that dense out and this needs to create an output of size 86 vocab size we need to create a something which can match to the one hot encoded List of possible characters, which is 86 long So now that we've got this orange arrow. We can apply that to our final hidden state to get our output So in Keras all we need to do now is call model passing in The three inputs and so the three inputs were returned to us way back here Each time we created an embedding we returned the input layer so C1 in C2 in C3 so passing in the three inputs and passing in our output So that's our model and so we can now compile it set a learning rate fit it and As you can see its losses gradually decreasing And we can then test that out Very easily by creating a little function that we're going to pass three letters We're going to take those three letters and turn them into character indices So turn to look them up to find the indexes Turn each of those into a NumPy array call model dot predict on those three arrays That gives us 86 outputs for each 86 outputs which we then do Artmax to find which one of those 86 which index into those 86 is the highest and that's the character number that we want to return So if we pass in PH I it thinks that L is most likely next Space th it thinks E is most likely next space a n it thinks that D is most likely next So you can see that it seems to be doing a pretty reasonable job of taking three characters and returning a fourth character that seems pretty sensible Not the world's most powerful model but a good example of how we can construct Kind of pretty arbitrary architectures using Keras and then letting SGD do the work Shouldn't own This model How would it consider the context in which we are trying to you know predict the next it's knows nothing about the Context all it has at any point in time is the previous three characters So it's not a great model. They're going to improve it though We're gonna start somewhere At a later stage where we actually be doing the doing the predictions on real data How would the context factor we're getting there? So in order to answer your question Let's build this up a little further and rather than trying to predict character for in the previous three characters Let's try and predict character n in the previous n minus one characters and Since all of these circles basically mean the same thing which is the hidden state at this point And since all of these orange arrows are literally the same thing It's a dense layer with exactly the same weight metrics Let's stick all of the circles on top of each other which means that these orange arrows Then can just become one arrow pointing into itself And this is the definition of a recurrent neural network When we see it in this form, we say that we're looking at it in its recurrent form When we see it in this form we can say that we're looking at it in its unrolled form or unfolded form They're both very common. This is obviously neater, right and so for Quickly sketching out an RNN architecture. This is much more convenient, but actually this unrolled form is really important for example When Keras uses tensorflow as a back-end it actually always unrolls it in this way in order to compute it That obviously takes up a lot more memory And so it's quite nice being able to use the piano back end with Keras Which can actually directly implement it as this kind of loop And that's what we'll be doing today shortly But in general we've got the same idea. We're going to have character 1 input come in Go through the first green arrow Go through the first orange arrow and then from then on we can just say take the second character Repeat third character repeat and at each time period We're getting a new character going through a layer operation as well as taking the previous hidden state and putting it through its layer operation And then at the very end We will put it through a different layer operation the blue arrow to get an output So I'm going to show you this in Keras now Does every fully connected layer have to have the same activation function? In in general no In a in all of the models we've seen so far We have constructed them in a way where you can write anything you like as the activation function In general though I haven't seen any examples of successful Architectures which mix activation functions other than of course at the output layer would pretty much always be a softness classification I'm not sure it's Not something that might become a good idea. It's just not something that anybody's Done anything very successfully with so far. I will mention something important about activation functions though, which is that You can use pretty much almost any non-linear function as an activation function and get pretty reasonable results There are actually some papers and pretty cool papers people have written where they've Tried all kinds of weird activation functions and they pretty much all work So it's not something to get hung up about It's more just certain activation functions will train more quickly and more resiliently and in particular value and value variations tend to work particularly well Okay, so let's let's implement this So we're going to use a very similar approach to what we used before And we're going to create our first RNN and we're going to create it from scratch Using nothing but standard Keras dense layers In this case the inputs Will not be we can't create C1 C2 and C3 We're going to have to create an array of our inputs We're going to have to decide what N are we going to use and so for this one I've decided to use eight so CS is characters, so I'm going to use eight characters to predict the ninth character So I'm going to create an array with eight elements in it and each element will contain a list of the 0 8 16 24th character the 1 5 and 17 federal character the 2 10 18 cetera character just like before okay, so we're going to have a sequence of inputs where each one is a Is offset by one from the previous one and Then our output will be exactly the same thing Except we're going to look at the Indexed across by CS so eight so this will be the eighth thing in each sequence. I'm going to predict it with the previous ones So now we can go through every one of those input data items lists and turn them into a NumPy array And so here you can see that we have Eight inputs and each one is a blank 75,000 or so Do the same thing for our why get a numpy array out of it and here we can Visualize it So here are the first eight elements of X So this so the in looking at the first eight elements of X Let's look at the very first element of each one 40 40 to 29 So this column is the first eight characters of our text and here It's the ninth character So the first thing that the model will try to do is to look at these eight to predict this And then it'll look at these eight to predict this and if at these eight and predict this and so forth And indeed you can see that this List here is exactly the same as this list here The final character of each sequence is the same as the first character of the next sequence So it's almost exactly the same as our previous data. We've just done it in a more flexible way We'll create 43 latent factors as before Where we use exactly the same embedding input function as before and again, we're just going to have to use Lists to store everything so in this case all of our embeddings are going to be in a list So we'll go through each of our Characters and create an embedding input and output for each one. We'll store it here And here we have we're going to define them all at once our green arrow orange arrow and blue arrow Okay, so here we basically saying we've got three different weight matrices that we want Keras to keep track of for us So the very first hidden state Here is Going to take the list of all of our inputs going to take the first one of those and Then that's a couple of two things the first is the input to it and the second is the output of the embedding So we're going to take the output of the embedding for the very first character pass that into our green arrow and that's going to give us our initial hidden state and Then this looks exactly the same as we saw before but rather than doing it Listing separately. We're just going to loop through all of our remaining one through eight characters and go ahead and create the green arrow Orange arrow and add the two together So finally we can take that final hidden state put it through our blue arrow To create our final output so we can then tell Keras that our model is all of the embedding inputs For that list we created together And that's our inputs and then our output that we just created is the output and we can go ahead and fit that model So we would expect this to be more accurate because it's now got eight pieces of context in order to predict So previously we were getting Yeah, this time we get down to 1.8. Okay, so it's still not great But it's an improvement and we can create exactly the same kind of test as before So now we can pass in eight characters and get a prediction of the ninth and these all look pretty reasonable So that is our first RNN what we've now built from scratch This kind of RNN where we're taking a list and predicting a single thing Is most likely to be useful for things like? Sentiment analysis remember our sentiment analysis example using IMDV So in this case we were taking a sequence being a list of words in a sentence and predicting whether or not something is positive sentiment or negative sentiment, so that would seem like a appropriate kind of use case for this style of RNN so at that moment my computer crashed and We lost a little bit of the classes video so I'm just going to fill in the bit that we missed here, but sorry for the slight discontinuity So I wanted to show you something kind of interesting which you may have noticed which is when we created our Hidden Hidden dense layer that is our orange arrow. I Did not initialize it in the default way, which is the glory initialization, but instead I said in it equals identity You may also have noticed that the equivalent thing Was shown in our Keras RNN This here where it says inner in it equals identity was referring to the same thing It's referring to what is the initialization that is used for this orange arrow How are those weights originally initialized? So rather than initializing them randomly We're going to initialize them with an identity matrix And identity matrix you may recall from your linear algebra at school is a matrix Which is all zeros except it is just ones down the diagonal So if you multiply any matrix by the identity matrix It doesn't change the original matrix at all you get back exactly what you started with So in other words, we're going to start off by initializing our orange arrow Not with a random matrix But with a matrix that causes the hidden state to not change at all That makes some intuitive sense It seems reasonable to say well in the absence of other knowledge to the country Why don't we start off by having the hidden state stay the same until the SGD has a chance to update that But it actually turns out that it also makes sense based on an empirical analysis So since we always only do things that Jeffrey Hinton tells us to do That's good news because this is a paper by Jeff Hinton in which he points out this rather neat trick which is if you initialize an RNN with the hidden weight matrix initialized to an identity matrix and use rectified linear units as We are here You actually get an Architecture which can get fantastic results on some reasonably significant Problems including speech recognition and language modeling. I don't see this paper referred to or discussed very often Even though it is well over a year old now So I'm not sure if people forgot about it or haven't noticed it or or what but this is actually a good trick to remember Is that you can often get quite a long way doing nothing but an Identity matrix initialization and rectified linear units in just as we have done here to set up our architecture Okay, so that's a nice little trick to remember And so the next thing we're going to do is to make a couple of minor changes to this diagram So the first change we're going to make is we're going to take this this rectangle here So this rectangle is referring to what is it that we repeat and so since in this case We're predicting character n from characters one through n minus one Then this whole area here we're looping pro from two to n minus one before we generate our output once at the end So what we're going to do is we're going to take this triangle And we're going to put it inside the loop put it inside the rectangle and So what that means is that every time we loop through this we're going to generate another output So rather than generating one output at the end this is going to predict characters two through n Using characters in one through n minus one So it's going to predict character two using character one and character three using characters one and two and character four using characters one two and three and so forth And so that's what this model would do. It's nearly exactly the same as the previous model except after every single step After creating the hidden state on every step We're going to create an output every time. So this is not going to create a single output Like this does which could predict a single character the last character In fact the next after the last character of the sequence character n using characters one through n minus one This is going to predict all sequence of characters two through n using characters one through n minus one Okay, so that was all the stuff that we'd lost when we had our computer crash So let's now go back to the lesson Let's now talk about how we would implement this sequence where we're going to predict characters two through n Using characters one through n minus one Now why would this be a good idea? There's a few reasons but one obvious reason why this would be a good idea is that if we're only predicting one output for every n inputs Then the number of times that our model has the opportunity to That propagate those in great gradients and improve those weights is just once for each sequence of characters Where else if we predict characters two through n using characters one through n minus one We're actually getting a whole lot of feedback about how our models coming so we can back propagate n times or actually n minus one times Every time we do another sequence So there's a lot more learning going on for the same amount or nearly the same amount of computation The other reason this is handy is that as you'll see in a moment. It's very helpful for creating RNNs which can do truly long-term dependencies or Context as one of the people asking a question earlier described it So we're going to start here before we look at how to do context and so Really anytime you're doing a kind of sequence-to-sequence exercise You probably want to construct something of this format where your Triangle is inside the square rather than outside the square. It's going to look very very similar And so I'm calling this returning sequences. So we're rather returning a single character. We're going to return a sequence and really most things are the same our Character in data is identical to before. So I've just commented it out and Now our character out output isn't just a single character But it's actually a list of eight sequences again. It's in fact. It's exactly the same as the input Except that I have removed the minus one. So it's just shifted over by one So my in each sequence the first character will be used to predict the second The first and second will predict the third the first second and third will predict the fourth and so forth So we've got a lot more predictions going on and therefore a lot more opportunity for the model to learn So then we will Create our Y's just as before with our X's And so now our Y Data set looks exactly like our X data set did but everything's just shifted across by one character and The model is going to look almost identical as well. We've got our three dense layers as before But we're going to do one other thing different to before Rather than treating the first character as special. I won't treat it as special I'm going to move the character into here So that rather than repeating from 2 to n minus 1 I'm going to repeat from 1 to n minus 1 So I've moved my first character into here So the only thing I have to be careful of is that we have to somehow initialize our hidden state to something So we're going to initialize our hidden state to a vector of zeros So here we do that we say okay We're going to have to have something to initialize our hidden state Which we're going to feed it with a vector of zeros shortly. So our initial hidden state is just going to be the result of that and Then our loop is identical to before But at the end of every loop we're going to append This output so we're now going to have eight outputs for every sequence rather than one and So now our model has two changes. The first is it's got an array of outputs and the second is that we have to add the thing that we're going to use store our Vector of zeros somewhere and so we're going to put this into our input as well Back on the diagram. Could you say what the box means again? the box refers to the The area that we're looping so initially, we repeated the Character n input coming into here and then the hidden state going back to itself from 2 to n minus 1 So the box is the thing which I'm looping through all those times And this time I'm looping through this whole thing So character input coming in generating the hidden state and creating an output repeating that whole thing every time And so now you can see Creating the output is inside the loop Rather than outside the loop. So therefore we end up with an array of outputs So that's so our models nearly exactly the same as before. It's just got these two changes so now when we Fit our model, we're going to add a array of zeros to the start of our inputs Outputs are going to be those lists of eight that have been offset by one And we can go ahead and train this and you can see that as we train it You can see that as we train it now. We don't just have one loss We have eight losses and that's because every one of those eight outputs has its own loss How are we going at predicting character one in each sequence two three four and as you would expect our ability to predict the first character? Using nothing but a vector of zeros is pretty limited. So that very quickly flattens out Where else our ability to predict the eighth character or it has a lot more context, right? It has eight character seven characters of context and so you can see that the eighth character's loss keeps on improving and Indeed but by a few epochs we have a significantly better loss than we did before So this is what a sequence model looks like and so you can see a sequence model when we test it we pass in a sequence like this space this is and After every character it returns its guess so after seeing a space It guesses the next will be a T after seeing a space T because it's an extra be an H After seeing space T H guess is the next will be a D and so forth Okay, and so you can see that it's you know predicting some pretty reasonable Things here and indeed quite often there what actually happened So it sees after seeing space P a RT it expects that will be the end of the word and indeed it was and after seeing part It's guessing that the next word is going to be of and indeed it was Okay, so it's able to use sequences of eight to create context Which isn't brilliant, but it's it's an improvement. So how do we do that same thing with Keras? with Keras It's identical to our previous model Except that we have to use the different Input and output arrays just like I just showed you so the whole sequence of labels and the whole sequence of inputs and Then the second thing we have to do is add one parameter, which is return sequences equals true Return sequences equals true simply says rather than putting the triangle outside the loop Put the triangle inside the loop and so return an output from every time you go to another time step Rather than just returning a single output at the end So it's that easy in Keras. I add this return sequences equals true I don't have to change my data at all other than some Very minor dimensionality changes And then I can just fit go ahead and fit it as you can see I get a pretty similar loss function to what I did before and I can Build something that looks very much like we had before and generate some pretty similar results Okay, so that's how we create a sequence model with Keras Okay, so then the question of well, how do you create more state? how do you Generate a model which is able to handle long-term dependencies to generate a model that Understands long-term dependencies. We can't anymore present our pieces of data at random so so far we've always been using the Default when we do fit model, which is shuffle equals true. So it's passing across these sequences of age in a random order If we're going to do something which understands long-term dependencies The first thing we're going to have to do is we're going to have to use shuffle equals false The second thing we're going to have to do is we're going to have to stop passing in an array of zeros as My starting point every time around so effectively what I want to do is I want to Pass in my array of zeros right at the very start when I first start training But then at the end of my sequence of eight Rather than going back to initialize to zeros actually want to keep this hidden state Right, and then so then I'd start my next sequence of eight with this hidden state Exactly where it was before and that's going to allow it to basically build up arbitrarily long dependencies so in Keras That's actually as simple as adding one additional parameter and the additional parameter is called Stateful and so when you say stateful equals true What that tells Keras is that at the end of each sequence? Don't reset the hidden activations to zero, but leave them as they are And that means that we have to make sure we pass shuffle equals false When we train it So it's not going to pass the first eight characters of the book and then the second eight characters of the book And the third eight characters of the book leaving the hidden State untouched between each one and therefore it's allowing it to continue to build up as much state as it wants to Training these stateful models is a lot harder than training the models that we've seen so far And the reason is this in these stateful models this orange arrow This this single weight matrix It's being applied to this hidden matrix not eight times but 100,000 plans Or more depending on how big your text is And just imagine if this weight matrix Was even slightly poorly scaled. So if there was like one number in it, which was just a bit too high Then effectively that number is going to be to the power of 100,000 Right, it's being multiplied again and again and again So What can happen is you get this problem they call Exploding gradients or really in some ways it's better described as exploding activations Because we're multiplying This by this almost the same weight matrix each time If that weight matrix is anything less than perfectly scaled Then it's going to make our hidden matrix disappear off into infinity And so we have to be very careful of how to train these and indeed These kinds of long-term dependency models were thought of as impossible to train for a while Until some folks in the When was it mid 90s? I guess came up with a model called the LSTM or long short-term memory And in the long short-term memory and we'll learn more about it next week And we're actually going to implement it ourselves from scratch We replace this loop here With a loop where there is actually a neural network inside the loop That decides how much of this state matrix to keep And how much to use at each activation And so by having a neural network which actually controls How much state is kept and how much is used it can actually learn how to avoid those Those gradient explosions that can actually learn how to create an kind of an effective sequence So we're going to look at that a lot more next week, but for now I will tell you that when I tried to run this using a simple RNN even with an identity matrix Initialization and values I had no luck at all So I had to replace it with an LSTM Even that wasn't enough. I had to have well-scaled inputs So I added a batch normalization layer after my embeddings And after I did those things Then I could fit it It's still ran pretty slowly. So before I was getting four seconds per epoch Now it's 13 seconds per epoch And the reason here is it's much harder to paralyze this. It has to do each sequence in order You know So it's going to be slower But over time it does eventually get Substantially better loss than I had before and that's because it's able to keep track of and use this state Doesn't it make sense to use batch norm in the loop as well? That's a good question Definitely maybe There's been a lot of discussion and papers about this recently There's something called layer normalization, which is a method which is explicitly designed to Work well with RNNs Standard batch norm doesn't It turns out it's actually very easy to do layer normalization with Keras using a simple couple of simple parameters You can provide to the normal batch norm constructor And my experiments that hasn't worked so well And I will show you a lot more about that in just a few minutes Okay so Stateful models are great. We're going to look at some very successful stateful models in just a moment But just be aware that they are more challenging to train You'll see another thing I had to do here is I had to reduce the learning rate in the middle Again because you just have to be so careful of these kind of exploding gradient problems Okay so Let me show you what I did with this Which is I tried to Create a stateful model, which worked as well as I could So I took the same Nietzsche data as before And I tried splitting it into chunks of 40 rather than eight so each one could do more work So here are some examples of those chunks of 40 And I built a model that was slightly more sophisticated than the previous one in two ways. So first is It has an RNN feeding into an RNN That's kind of a crazy idea. So I've drawn a picture Um So an RNN feeding into an RNN means that the output is no longer Going to an output. It's actually the output of the first RNN Is becoming the input to the second RNN? right So the character input goes into our first RNN and has the state updates as per usual And then each time we go through the sequence it feeds the result To the state of the second RNN Why is this useful? Well, because it means that this output Is now coming from not just a single dense matrix with a And then a single dense matrix here. It's actually going through one two three dense matrices and activation functions So I now have a deep neural network assuming that two layers gets to count as deep Between my first character and my first output And then indeed between every hidden state and every output. I now have multiple hidden layers So effectively what this is allowing us to do is to create a little deep neural net for all of our activations and That turns out to work really well Because the structure of language is Pretty complex and so it's nice to be able to give it a more Flexible function that it can learn So that's the first thing I do and it's it's this easy to create that you just Copy and paste your whatever your RNN line is twice You can see I've now added dropout inside My RNN and as I talked about before Adding dropout inside your RNN Turns out to be a really good idea and there's a really great paper about that quite recently showing that this is a Great way to regularize an RNN And then the second change I made is rather than going straight from the RNN to our output I went through a dense layer Now there's something that you might have noticed here is that our dense layers Have this extra word at the front Why do they have this extra word at the front time distributed? It might be easier to understand why by looking at this earlier sequence model with Keras And note that the output of our RNN is not just a vector of length 256 but eight Vectors of length 256 because it's actually predicting eight outputs So we can't just have a normal dense layer because a normal dense layer is it needs a Single a single dimension that it can squish down So in this case what we actually want to do is we want to create eight separate dense layers at the output One for every one of the outputs And so what time distributed does is it says whatever the layer is in the middle? I want you to create Eight copies of them or however many however long this dimension is And every one of those copies is going to share the same weight matrix Which is exactly what we want, right? So the short version here is in Keras any time you say return sequences equals true Any dense layers you have after that will always have to have time distributed Wrapped around them because we want to create not just one dense layer but eight dense layers So in this case Since we are saying return sequences equals true We then have a time distributed dense layer Some dropout and another time distributed dense layer Great a few questions. Um, does the first RNN complete before it passes to the second or is it layer by layer? No, it's it's it's just a um, it's just it's operating exactly like this. So my initialization Starts my first character comes in And at the output of that Comes two things the hidden state for my next hidden state and the output that goes into My second lst lstm. So everything is just pushed through If you try the best way to think of this actually would probably be to actually draw it in the unrolled form Okay, and then you'll realize there's nothing Magical about this at all. It's actually in an unrolled form. It just looks like a pretty standard deep neural net What's dropout underscore u and dropout underscore w? We'll talk about that More next week In an lstm. I mentioned that there's kind of like little neural nets that control how the state updates work And so this is talking about how the dropout works inside these like little neural nets And when stateful is false, can you explain again? What is reset after each training example? sure The best way to describe that is to Show us doing it so remember that the RNNs that we built Are identical to what keras does or close enough to identical. So let's go and have a look at Our version of return sequences So you can see that what we did was we created a matrix of zeros That we stuck on to the front of our inputs So every set of eight characters now starts with a vector of zeros so in other words this Initialized to zeros happens Every time we finish a sequence. So in other words this hidden state Gets initialized to zero at the end of every sequence and it's this hidden state Which is where all of these Dependencies and state is is kept So doing that is resetting the state Every time we look at a new sequence. So when we say stateful equals false It it only does this initialize to zero step once at the very start or When we explicitly ask it to and so when I actually Run this model The way I do it is I wrote a little thing called run epochs that goes model dot reset states And then does a fit on one epoch, which is what you really want right at the end of Your entire works of Nietzsche you you want to reset the state because you're about to go back to the very start and start again so So with this multi-layer lstm going into a multi-layer neural net I then tried seeing how that goes and remember that with our simpler Versions we were getting kind of 1.6 loss was the best we could do um After one epoch It's awful and if I and now rather than just printing out One letter i'm starting with a whole sequence of letters Which is that and asking it to generate a sequence and you can see it starts out by generating a pretty rubbishy sequence One more question in in the double lstm layer model What is the input to the second lstm in addition to the output of the first lstm? In addition to the output of the first lstm is the previous output of its own hidden state Okay, so after a few more epochs it's starting to Create some actual proper English words although the English words aren't necessarily making a lot of sense So I keep running epochs At this point it's learned how to start chapters. This is actually how In this book the chapters always start with a number and then an equal sign It hasn't learned how to close quotes apparently. It's not really saying anything useful So anyway, I kind of ran this overnight and I then Seated it with a larger amount of data. So I seeded it with all this data and I started getting some pretty reasonable results Shreds into one's own suffering sounds exactly like the kind of thing that you might see um Religions have acts done by man. I mean it's not all Perfect, but it's not bad interestingly This sequence here when I looked it up and it actually appears in his book and this makes sense, right? It's a kind of overfitting in a sense When um, you know, he loves talking in all caps, right, but he only does it from time to time and so once the um It's so happened to start writing something in all caps that looked like this phrase That only appeared once and it's very unique There was kind of no other way that it could have finished it, right? So sometimes you get like these little rare phrases that basically it's plagiarized directly from nature Now I didn't stop there because I thought how can we improve this? And it was at this point that I started thinking about batch normalization And I started fit fiddling around with a lot of different types of batch normalization and layer normalization and discovered this interesting Insight Which is that the at least in this case the very best approach was when I simply applied batch normalization To the embedding layer So um, I want to show you what happened When I applied batch normalization to the embedding layer This is the training curve that I got so over epochs. This is my loss With no batch normalization on the embedding layer. This was my loss And so you can see this was actually starting to flatten out This one really wasn't and this one was training a lot quicker So then I tried training it with batch norm On the embedding layer overnight and I was pretty stunned by the results This was my seeding text And after a thousand epochs Sorry after Yeah, a thousand epochs This is what it came up with and it's got all kinds of like actually pretty interesting little things Perhaps some morality equals self glorification Uh, this is a really cool For there are holy eyes to Schopenhauer's blind. This is like interesting In reality, we must step above it You can see that it's learned to close quotes even when those quotes were opened a long time ago Right, so if we weren't using state for it would never have learned how to do this And so and I've I've I've looked up these words in the original text and that these the none Pretty much none of these phrases appear. This is actually a genuine novel novelly produced piece of text it's not Um You know, it's not perfect by any means but considering that this is only Doing a character by character Using nothing but a 42 long embedding matrix for each character And nothing but there's there's no Pre-trained vectors or anything. There's just a pretty short, you know, 600,000 Character epoch. I think it's done a pretty amazing job of creating a Pretty good model And so there's all kinds of things you could do with a model like this. I mean the most obvious one would be if you were Producing a you know a software keyboard for a mobile phone For example, you could use this to have a pretty accurate guess as to what they're going to type next and correct it for them Um, you could do something similar on a word basis but more generally You could do something like Anomaly detection with this you could generate a sequence that Is predicting what the rest of the sequence is going to look like for the next hour And then recognize if something kind of falls outside of what your prediction was and then you know that there's been some kind of anomaly There's all kinds of things you can do with these kinds of with these kinds of models I think that's pretty fun But I want to show you something else which is pretty fun Which is to Build an rnn From scratch in thiano And what we're going to do is we're going to try and work up to next week Where we're going to build an rnn from scratch in numpy And we're also going to build a an lstm from scratch In thiano So we're going to try and and the reason we're doing this is because you know next week's our last class In in this part of the course, you know, I want us to Leave with kind of feeling like we really understand the details of what's going on Behind the scenes like the main thing I wanted to teach in this class is the applied stuff You know these kind of practical tips about like how you build a sequence model You know use return equals true put batch null in the embedding layer you know At time distributed to the dense layer But I also I know that to To kind of really debug Your models and to build your architectures and stuff you it really helps to understand what's going on Particularly, you know in the current situation where the tools and libraries available are not that mature They still require a whole lot of kind of manual stuff. So I do want to try and Explain a bit more about what's going on behind the scenes so In order to build an rnn in theano and then first of all Make a small change to our keras model Which is that i'm going to use one hot encoding So I don't know if you noticed this but we did something pretty pretty cool and all of our models so far Which is that we never actually One hot encoded our output Well time distributed dense take longer to train than dense And is it really that important to use time distributed dense? So if you don't add time distributed dense to a model where return sequences equals true It literally won't work It won't compile because you're trying to predict eight things and the dense layer is going to stick that all into One thing so it's going to say there's a mismatch in your In your dimensions But no it it doesn't really add Much time because that's something that can be very easily paralyzed And since like a lot of things in rnn's can't be easily paralyzed There generally is plenty of room in your gpu to do more work. So Yeah, so that should be fine. There's no But I mean the short answer is you have to use it otherwise Okay, I wanted to point out something which is that in all of our models so far We did not One hot encode our outputs So our outputs remember looked like looked like this They were sequences of numbers And so always before we've had to one hot encode our outputs to use them That turns out that keras has a very cool loss function called sparse categorical cross entropy And this is identical to categorical cross entropy But rather than taking a one hot encoded Target It takes an integer target And basically it acts as if you had one hot encoded it. So it basically does the indexing into it directly so this is a Really helpful thing to know about because when you have a lot of output categories Like for example, if you're doing a word model You could have a hundred thousand output categories. There's no way you want to create a matrix that is a hundred thousand long Nearly all zeros for every single word in your output So by using sparse categorical cross entropy, you can just forget the whole one hot encode. You just you don't have to do it keras Implicitly does it for you but without ever actually explicitly doing it It just does a direct look up into the matrix However Because I want to make things simpler for us to understand I'm going to go ahead and recreate our keras model using one hot encoding And so I'm going to take exactly the same model that we had before With return sequences equals true But this time I'm going to use normal categorical cross entropy Which means that oh and the other thing I'm doing is I don't have an embedding layer Right, so since I don't have an embedding layer, I also have to one hot encode my inputs So you can see I'm calling two categorical On all of my inputs and two categorical on all of my outputs So now the shape is 75 000 by eight as before by 86 So this is the one hot encoding dimension of which there are 85 zeros and one one So we fit this in exactly the same way We get exactly the same answer So the only reason I was doing that was because I want to Use one hot encoding for the version that we're going to create Our cells from scratch Okay, so we haven't really looked at theano before But particularly if you come back next year As we start to try to like Add more and more stuff on top of keras Or into keras increasingly you'll find yourself wanting to use theano because theano is The language if you like that keras Is using behind the scenes and therefore it's kind of the language which you can use to extend it Of course, you can use tensorflow as well But we're using theano in this course because I think it's much easier For this kind of application So let's learn to use theano and in the process Of doing it in theano, we're going to have to Kind of force ourselves to think through a lot more of the details than we have before Because theano because theano doesn't have any of the conveniences that keras has there's no such thing as a layer You know We have to think about all of the weight matrices and activation functions and everything ourselves So let me show you how it works In theano, there's this concept of a A variable and a variable is something which we basically define like so we can say there is a variable Which is a matrix which I will call t input And there is a variable which is a matrix that will call t output And there is a variable that is a vector that we will call h0 Now what these are all saying is that these are Things that we will give values to later Programming in theano is very different to programming in normal python and the reason for this is theano's kind of job in life Is to provide a way for you to describe a computation that you want to do And then it's going to compile it for the gpu And then it's going to run it on the gpu So it's going to be a little more complex to work in theano because theano isn't going to be something where we immediately Say do this and then do this and then do this instead. We're going to build up what's called a computation graph It's going to be a series of steps. We're going to say in the future I'm going to give you some data and when I do I want you to do these steps So rather than actually starting off by giving it data We start off by just describing the types of data that when we do give it data, we're going to give it So eventually we're going to give it Some input data We're going to give it some output data and we're going to give it Some way of initializing the first hidden state Oh, and also we'll give it a learning rate because we might want to change it later So that's all these things do they create theano variables All right, so then we can create a list of those And so this is all of the arguments that we're going to have to provide to theano later on So we haven't there's no data here. Nothing's being computed. We're just telling theano that these things are going to be used in the future The next thing then we need to do is because we're going to try to build We're going to try to build this All right is we're going to have to build all of the pieces in all of these layer operations So specifically we're going to have to create the weight vector and bias matrix for the orange arrow The weight vector in the bias matrix of the green arrow Sorry, the weight matrix and the bias vector for the orange arrow The weight matrix and the bias vector for the green arrow and the weight matrix and the bias vector for the blue arrow Because that's what these layer operations are there a matrix multiply Followed by a non-mini activation function Um, so I've created some functions to do that So wh is what I'm going to call the Weights and bias to my hidden layer Wx will be my weights and bias of my input and w y will be my weights and bias of my output And so to create them, I've created this little function called weights and bias in which I tell it The size of the matrix that I want to create So the matrix that goes from input to hidden Therefore has n input rows and n hidden columns So weights and bias Is here and it's going to return a tuple It's going to return our weights And it's going to return our bias Okay, so how do we create the weights? To create the weights we first of all calculate the magic glouro number square root of 2 over fan in So that's the scale of the random numbers that we're going to use We then create those random numbers using the numpy normal random number function And then we use a special theano Keyword called shed and what shed does is it says to theano This is this data Is something that i'm going to want you to pass off to the gpu later and keep track of so as soon as you wrap Something in shed it kind of belongs to theano now. Okay, so here is a weight matrix that belongs to theano Here is a vector of zeros That belongs to theano and that's our initial bias. Okay, so we've initialized our weights and our bias and so we can do that for our inputs And we can do that for our outputs And then for our hidden Okay, which is the orange arrow remember We're going to do something slightly different which is we will initialize it using an identity matrix And rather amusingly in numpy it is i for identity Okay, so this is an identity matrix believe it or not of size n by n And so that's our initial weights and our initial bias is exactly as before. It's a vector of zeros So you can see we've had to manually construct each of these three weight matrices and bias vectors It's nice to now stick them all into a single list And python has this thing called chain from iterable Which basically takes all of these tuples and dumps them all together into a single list And so this now has all six Weight matrices and bias vectors in a single list Okay, now We have defined the initial contents of each of these arrows And we've also defined kind of symbolically The concept that we're going to have something to initialize it with here Something to initialize it with here and some target to initialize it with here So the next thing we have to do is to tell theano What happens each time we take a single step of this RNA On the gpu you can't use a for loop The reason you can't use a for loop is because a gpu wants to be able to paralyze things and wants to do things at the same time And a for loop by definition can't do the second part of the loop until it's done the first part of the loop I don't know if we'll get time to do it in this course or not But there's a very neat result which shows that there's something very similar to a for loop that you can paralyze And it's called a scan operation A scan operation Is something that's defined in a very particular way a scan operation is something where you call some function for every element of some sequence And at every point The function returns some output and the next time through that function is called It's going to get the output of the previous time you called it Along with the next element of the sequence So in fact, I've got an example of it um I actually wrote a very simple example of it in hyphen Which I think is here Here is the definition of scan and here is an example of scan. Let's start with the example I want to do a scan and the function i'm going to use is to add two things together And I am going to start off with the number zero And then I'm going to pass in a range of numbers from naught to four Okay, so what scan does Is it starts out by taking the first time through it's going to call this function with that argument And the first element of this so it's going to be zero plus zero equals zero The second time it's going to call This function with the second element of this Along with the result of the previous call. So it'll be zero plus one equals one The next time through it's going to call This function with the result of the previous call Plus the next element of this range. So it'll be one plus two equals three So you can see here this scan operation defines a cumulative sum And so you can see the definition of scan here We're going to be returning an array of results Initially, we take our starting point zero and that's our initial value for previous previous answer from scan And then we're going to go through everything in the sequence, which is not through four We're going to apply this function, which in this case was add things up And we're going to apply it through the previous result Along with the next element of the sequence Stick the result at the end of our list Set the previous result to whatever we just got and then go back to the next element of the sequence So It may be very surprising I mean, hopefully it is very surprising because it's an extraordinary result But this it is possible to write a parallel version of this Right, so if you can turn your algorithm into a scan you can run it quickly on gpu So what we're going to do is our job is to turn the this r and n Into something that we can put into this kind of format into a scan So let's do that Okay So the function that we're going to call on each step through is the function called step And the function called step is going to be something which hopefully will not be very surprising to you It's going to be something which takes our input x It does a dot product by that weight matrix we created earlier w x and adds on that bias step to we created earlier And then we do the same thing taking our previous hidden state Multiplying it by the weight matrix for the hidden state and adding the biases for the hidden state And then puts the whole thing through an activation function value So in other words that was calculating these In fact, let's go back to the unrolled version So we had one bit which was calculating our previous hidden state and putting it through the hidden state weight Weight matrix, which is an orange arrow. It was taking our next input and putting it through the input One and then adding the two together Okay, so that's what we have here the x by w x and the h by wh And then adding the two together along with the biases and then put that through an activation function So once we've done that we now want to create an output Every single time right and so our output is going to be exactly the same thing It's going to take the result of that which we called h Our hidden state multiply it by the outputs weight vector adding on the bias and this time we're going to use soft mess Okay, so you can see that this Sequence here is describing how to do One of these things Okay, and so this therefore defines what we want to do each step through And at the end of that we're going to return the hidden state we have so far and our output So that's what's going to happen each step So the sequence that we're going to pass it pass into it is well We're not going to give it any data yet because remember all we're doing is we're describing a computation So for now we're just telling it that it's going to be It will be a matrix Right, so we're saying it will be a matrix. We're going to pass you a matrix It also needs a starting point And so the starting point is again, we're going to we are going to provide to you An initial value for our hidden state But we haven't done it yet And then finally in tiano you have to tell it what are all of the other things that are passed to the function And we're going to pass it That whole list of weights. So that's why we have here the x the hidden And then all of whoopsie does he and then all of the weights and biases So that's now described How to execute A whole sequence of steps for an iron n. So we've now described how to do this To tiano. We haven't given it any data to do it. We've just set up the computation And so when that computation is run It's going to return two things because step return two things. It's going to return the hidden state And it's going to return our output activations So now we need to calculate our error So our error will be the categorical press entropy And so these things are all part of tiano. You can see I'm using some tiano functions here And so we're going to compare the output That came out of our scan and we're going to compare it to what We don't know yet, but it will be a matrix Okay And then once you do that add it all together Now here's the amazing thing Every step we're going to want to apply sgd. Which means every step we're going to want to Take the derivative of this whole thing With respect to all of the weights And use that along with the learning rate to update all of the weights and tiano That's how you do it. You just say Please tell me the gradient of this function with respect to these inputs And tiano will symbolically automatically calculate all of the derivatives for you So that's very nearly magic But we don't have to worry about derivatives Because it's going to calculate them all for us So at this point I now have a function that calculates our loss And I have a function that calculates all of the gradients that we need with respect to all of the weights parameters that we have So we're now ready to build our final function And so our final function as input takes all of our arguments That is these four things, which is the things we told it we're going to need later The thing that it's going to create as an output is The error Which was this output And then at each step it's going to do some updates And so it's going to updates. What are the updates it's going to do? The updates it's going to do is the result of this little function And this little function is something that creates a dictionary that is going to map every one of our weights To that weight minus each one of our gradients times the learning rate So it's going to update every weight to itself minus its gradient times the learning rate So basically what theano does is it says It's got this little thing called updates. It says every time you calculate the next step I want you to change Your shared variables as bullets. And so there's our list of changes to me Um, and so that's it So we use our one hot encoded x's and our one hot encoded y's And we have to now manually create our own loop, right? Theano doesn't have any built-in stuff for us So we're going to go through every element of our input And we're going to say, okay, let's call that function So that function is the function that we just created and now we have to pass in All of these inputs. So we have to finally pass in A value for the initial hidden state The input the target and the learning rate And so this is where we get to do it is when we finally call it here. So here's our Initial hidden state. It's just a bunch of zeros our input Our output and our learning rate, which we set to 0.01 And then I've just said to something here that says okay every thousand times print out the error And so as you can see over time it learns And so at the end of learning I create a new theano function Which takes some piece of input along with some initial hidden state And it produces not the loss but the output And so this yes, are we using gradient descent and not stochastic gradient descent here? We're We're using stochastic gradient descent with a mini batch size of one So gradient descent without stochastic actually means you're using a mini batch size of the whole data set So this is kind of the opposite of that. I think this is called online gradient descent. That's a good question. Thank you So remember earlier on We had this thing to calculate The vector of outputs Right, so rather than so now to do our testing We're going to create a new function which starts goes from our input to our vector of outputs And so our predictions Will be to take that function pass it in our initial hidden state and some input And that's going to give us some predictions So if we call it we can now see all right, let's now grab some Uh sequence of text Pass it to our Function to get some predictions and let's see what it does. So after t it expected h after th you expected e after th e expected space Actually really wanted a space after th g and expected a space after th en question mark get expected a space Finally got one after th en question mark space expected capital t and so forth So you can see here, um that we have um successfully Built an rnn from scratch using theano That's been a very Very quick run through My goal really Tonight is to kind of get to a point where you can start to look at this The week and kind of see all the pieces because next week Um, we're going to try and build an lstm in theano Which is going to mean that you know, I want you by next week to start to feel like you've kind of got a good understanding Of what's going on. So of course, please ask lots of questions on the forum Look at the documentation and so forth And then the next thing we're going to do after that Uh is we're going to build An rnn without using theano. We're going to use pure numpy And that means that we're not going to be able to use T dot grad we're going to have to calculate the gradients by hand um, so hopefully that will be a um a useful exercise in um And kind of really understanding what's going on in in that propagation Um So I kind of want to make sure you feel like you've got enough information to get started with looking at theano this week So did anybody want to kind of ask any questions? Uh about this piece so far So this is maybe a bit too far away from what we did today, but How would you apply an rnn to like say images something other than text? So that's something that's worth doing and and if so, what changes about it? Yeah, sure. So the main way in which an rnn is applied to images Is what we looked at last week, which is these things called attentional models, which is where you basically say Um, given what you're currently which part of the image you're currently looking at Which part would make sense to look at next? This is most useful On really big images where you can't um This is most useful on really big images where you can't really um Look at the whole thing at once because it would just eat up all your gpu's ram Um, and so you can only look at a little bit at a time Another way that um Uh Very useful For images is for captioning images and so We'll talk a lot more about this in the next year's course But have a think about this in the meantime If we've got an image then a cnn Can turn that into a Vector representation of that image For example, we could chuck it through vgg and take the penultimate layers activations Say there's all kinds of things we could do but you know in some way we can turn an image and turn it into Some vector representation of that We could do the same thing to a sentence We can take a sentence consisting of a number of words And we can stick that through a rnn And at the end of it We will get some State right and that state Is also just a vector What we could then do is learn a neural network which maps The picture to the text Assuming that this sentence was actually originally a caption that had been created for this image right and so in that way if we can learn a mapping from Some representation of the image that came out of a cnn To some representation of a sentence which came out of an rnn Then we could basically reverse that in order to generate captions For an image So basically what we could then do is we could take some new image that we've never seen before Chuck it through the cnn to get our state out And then we could figure out what rnn state We would expect would be attached to that based on this neural net that we had learnt And then we can basically do a Sequence generation just like we have been today and generate a sequence of words And this is roughly how These are image captioning systems that I've shown you. I'm sure you've seen this work. So So rnn's I guess finally the other the only other way in which I've seen rnn's applied to images is For like really big 3d images for example like in medical imaging So if you've got some like MRI that's basically a series of labors right It's too big to look at the whole thing Instead you can use an rnn to start in the top corner And then look One pixel to the left and then one pixel across and then one pixel back And then it can go down into the next layer and it can gradually look one pixel at a time And it can do that and gradually cover the whole thing And in that way it's gradually able to generate state about What is contained in this 3d volume? And so this is not something which is Very widely used at least at this point But I think it's It's worth thinking about because again you could combine this with a cnn, you know Like maybe you could have a cnn that looks at you know Large chunks of this MRI at a time Right and generate state for each of these chunks and then maybe you could use an rnn To go through the chunks, right? You know, there's all kinds of ways basically that you can combine cnn's and rnn's together So two questions So can you build a custom layer in the ono and then mix it with keras? Oh for sure. Yeah. And in fact, it's incredibly easy So if you go keras custom Layer, I'm guessing There's lots of examples of them That you're generally find on kind of they're generally kind of in the github issues to keras Where people will kind of show like oh, I was trying to build this layer and I had this problem But it's kind of a good way to see how to build them The other thing I find really useful to do Is to actually yeah, here's somebody who's created a merge layer of their own So The the other thing I find really useful to do is to actually look at the definition of The layers in keras So one of the things I actually did was I created this little Thing called pypath Which allows me to put in any Python Module and it returns the directory that that module is to find in And then so I can go okay. Let's have a look at how Any particular layer is defined so let's say I want to look at pooling All right, and so okay here is a max pooling 1d layer And you can see it's defined in nine lines of code All right, and so generally speaking you can kind of see that Layers don't take very much code at all Okay, two more questions Could we Give in a caption create an image? Yeah, you can absolutely create an image from a caption. There's a lot of Image generation stuff going on at the moment Um, it's not at a point that it's probably useful for anything in practice It's more like a interesting research journey, I guess So generally speaking this is in the area called generative models And we'll be looking at generative models next year because they're very important for unsupervised and semi supervised learning And what could get the best performance on a document classification task? cnn and rnn or both that's a great great great question. So let's go back to sentiment analysis And to remind ourselves when we looked at sentiment analysis for imdb the best result we got came from a Multi-size convolutional neural network Where we basically took a bunch of convolutional neural networks of varying sizes A simple convolutional network was nearly as good um So just this very simple convolutional neural network was nearly as good um, I actually tried um an lstm for this um, and uh, I found the uh accuracy that I got Was less good than the accuracy of the cnn And I think the reason for this is that when you have a You know a whole movie review, which is a few paragraphs The information you can get just by looking at a few words at a time Is enough to tell you whether this is like a positive review or Or a negative review, you know if if you see like a sequence of five words like this is totally shit You know, you can probably learn. That's not a good thing Where else if it's this is totally awesome. You can probably learn that is a good thing the amount of nuance built into reading sentence word by word an entire review It just doesn't seem like there's there's any need for that in practice and so in general for Once you get to a certain sized piece of text like a paragraph or two There doesn't seem to be any sign that r and n's are helpful At least at at this stage Okay, so before I closed off I wanted to just show you two little tricks because I don't spend enough time showing you cool little tricks And so when I was working with brad today, there was two little tricks that We realized that other people might like to learn about The first trick I wanted to point out to you is How do you If you want to learn about how a function works What would be a quick way to find out? And if you hit if you've got a function there on your screen and you hit shift tab All of the parameters to it will pop up If you hit shift tab twice The documentation will pop up So that was one little tip that I wanted you guys to know about so I think it's pretty handy The second little tip that you may not have been aware of is that you can actually run the python debugger inside jupyter notebook And so today we were trying to do that when we were trying to debug our pure python r and n So we can see an example of that So let's say we were having some problem inside Our loop here You can go import pdb. That's the python debugger and then you can set a breakpoint anywhere You can go pdb set trace. That's a breakpoint And so now if I run that As soon as it gets to here it pops up a little valor box and at this point I can look at Anything so for example, I can say okay. What's the value of her at this point? And I can say okay. Well, what are the lines I'm about to execute? And then I can say okay execute the next one line And it shows me what line it's coming next. So if you want to learn about the python debugger Just google for a python debugger But learning to use the debugger is one of the most helpful things because it lets you step through each Step of what's going on and see the values of all of your variables And do all kinds of cool stuff like that So those were two little tips. I thought I would leave you with so we can finish on a high note And that's nine o'clock. Thanks very much everybody