 So I wanted to start off by showing you something I'm kind of excited about, which is, here is the dogs and cats competition, which we all know so well, and it was interesting that the winner of this competition won by a very big margin, a 1.1% error versus a 1.7% error. This is very unusual in a Kaggle competition to see anybody win by, you know, what is that, 50 or 60% margin. You can see after that people are generally clustering around 91.1, 91.9, 91.1, 98.1, about the same kind of number. So this was a pretty impressive performance. This is the guy who actually created a piece of deep warning software called Overfee. So I want to show you something pretty interesting, which is this week I tried something new and on dogs and cats got 98.95. So I want to show you how I did that. The way I did that was by using nearly only techniques I've already shown you, which is basically I created a standard model, which is basically a dense model. And then I trained, I pre-computed the last convolutional layer and then I trained the dense model lots of times. And the other thing I did was to use some data augmentation. And I didn't actually have time to figure out the best data augmentation parameters, so I just picked some that seemed reasonable. I should also mention this 98.95 would be easy to make a lot better. I'm not doing any pseudo-labeling here, and I'm not even using the full dataset. I put aside 2000 for the validation set. So with those two changes, we would definitely get well over 99% accuracy. The missing piece that I added is I added batch normalization to VGG. So batch normalization, if you guys remember, I said the important takeaway is that all modern networks should use batch norm because you can get 10x or more improvements in training speed and it tends to reduce overfitting. Because of the second one, it means you can use less dropout. And dropout, of course, is destroying some of your networks, so you don't want to use more dropout than necessary. So why didn't VGG already have batch norm? Because it didn't exist. So VGG was kind of mid to late 2014 and batch norm was maybe, I can't quite remember, early to mid 2015. Okay, so why haven't people added batch norm to VGG already? And the answer is actually interesting to think about. So to remind you what batch norm is, batch norm is something which, first of all, normalizes every intermediate layer. So it normalizes all of the activations by subtracting the mean and dividing by the standard deviation, which is always a good idea. And I know somebody on the forum today asked, why is it a good idea? And I put a link to some more information about that. So anybody who wants to know more about why I do normalization, check out the forum. But just doing that alone isn't enough, because SGD is quite bloody minded. And so if it was trying to denormalize the activations, because I thought that was a good thing to do, it would do so anyway. So every time you tried to normalize them, SGD would just undo it again. So what batch norm does is it adds two additional trainable parameters to each layer, one which multiplies the activations and one which is added to the activations. So it basically allows it to undo the normalization, but not by changing every single weight, but by just changing two weights for each activation. So it makes things much more stable in practice. So you can't just go ahead and stick batch norm into a pre-trained network, because if you do, it's going to take that layer and it's going to divide all of the incoming activations by their standard deviation, sorry, by subtract the mean and divide by the standard deviation, which means now those pre-trained weights from then on are now wrong, because those weights were created for a completely different set of activations. So I mean it's not rocket science, but I realized all we need to do is to insert a batch norm layer and figure out what the mean and standard deviation of the incoming activations would be for that data set and basically create the batch norm layer such that the two trainable parameters immediately undo that. And so that way we would insert a batch norm layer and it would not change the outputs at all. So I grabbed the whole of ImageNet and I created this kind of our standard dense layer model. I pre-computed the convolutional outputs for all of ImageNet and then I created two batch norm layers and I created a little function which allows us to insert a layer into an existing model. So I inserted the layers just after the two dense layers. And then here is the key piece. I set the weights on the new batch norm layers equal to the variance and the mean, which I calculated on all of ImageNet. So I calculated the mean of each of those two layer outputs and the variance of each of those two layer outputs. And so that allowed me to insert these batch norm layers into an existing model. And then afterwards I evaluated it and I checked that indeed it's giving me the same answers as it was before. As well as doing that I then kind of thought if you train a model with batch norm from the start you're going to end up with weights which are designed to take advantage of the fact that the activations are being normalized. And so I thought I wonder what would happen if we now fine-tuned the ImageNet network on all of ImageNet after we added these batch norm layers. So I then tried training it for one epoch on both the ImageNet images and the horizontally flipped ImageNet images. So that's what these 2.5 million here are. And you can see with modern GPUs it only takes and with pre-computed convolutional layer only takes less than an hour to run the entirety of ImageNet twice. And the interesting thing was that my accuracy on the validation set went up from 63% to 67%. So adding batch norm actually improves ImageNet, which is cool. That wasn't the main reason I did it. The main reason I did it was so that we can now use VGG with batch norm in our models. Okay, so I did all that. I saved the weights. I then edited our VGG model. Okay, so if we now look at the fully connected block in our VGG model it now has batch norm in there. And I also saved to our website a new weights file called VGG16BN for batch norm. And so then when I did cats and dogs I used that model. So nowadays, so now if you go to and redownload from platform.ai the VGG16.pi it will automatically download the new weights. You will have this without any changes to your code. So I'll be interested to hear during the week if you try this out just rerun the code you've got whether you see improvements. And hopefully you will. Hopefully you'll find it trains more quickly and you get better results. At this stage I've only added batch norm to the dense layers not to the convolutional layers. There's no reason I shouldn't add it to the convolutional layers as well. I just had other things to do this week so. But hopefully since most of us are mainly fine-tuning just the dense layers this is going to impact most of us the most anyway. Yeah so that's an exciting step which everybody can now use. As well as the other thing to mention is now that you'll be using batch norm by default in your VGG networks you should find that you can increase your learning rates. Because batch norm normalizes the activations it means that it makes sure that there's no activation that's gone really high or really low and that means that generally speaking you can use higher learning rates. So if you try higher learning rates in your code than you were before you should find that they work pretty well. You should also find that things that previously you couldn't get to train now will start to train because often the reason that they don't train is because one of the activations shoots off into really high or really low and screws everything up and that kind of gets fixed when you use batch norm. So there's some things to try this week. I'd be interested to hear how you go. So last week we looked at collaborative filtering and to remind you we had a file that basically meant something like this. We had a bunch of movies and a bunch of users and for some subset of those combinations we had a review of that movie by that user. The way the actual file came to us however didn't look like this. This is a cross tab. The way the file came to us looked like this. Each row was a single user rating a single movie with a single rating at a single time. So I showed you in Excel how we could take the cross tab version and we could create a table of dot products where the dot products would be between a set of five random numbers for the movie and five random numbers for the user and that we could then use gradient descent to optimize those sets of five random numbers for every user and every movie and that if we did so we end up getting pretty decent guesses as to the original ratings and then we went a step further in the spreadsheet and we learned how you could take the dot product and you could also add on a single bias a movie bias and a user bias. So we saw all that in Excel and we also learned that in Excel Excel comes with a gradient descent solver called funnily enough solver and we saw that if we ran a solver on telling it that these are our varying cells and that this is our target cell then it came up with some pretty decent weight matrices. We learned that these kinds of weight matrices are called embeddings and embedding is basically something where we can start with an integer like 27 and look up movie number 27's vector of weights that's called an embedding. It's also in collaborative filtering this particular kind of embedding are known as latent factors and we learned that or we kind of hypothesized that once trained each of these latent factors may kind of mean something and I said next week we might come back and have a look and see if we can figure out what they mean. So that was what I thought I would do now. So I'm going to take the bias model that we created and the bias model we created was the one where we took a user embedding and a movie embedding and we took the dot product of the two and then we added to it a user bias and a movie bias where those biases are just embeddings which have a single output just like an excel the bias was a single cell for each movie and a single cell for each user. So then we tried fitting that model and you might remember that we ended up getting an accuracy that was quite a bit higher than previous state of the art. Actually we didn't sorry for that one we didn't for the previous state of the art we broke by using the neural network. I discovered something interesting during the week which is that I can get a state of the art result using just this simple bias model and the trick was that I just had to increase my regularization. So we haven't talked too much about regularization we've briefly mentioned it a couple of times but it's a very simple thing where we can basically say add to the loss function the sum of the squares of the weights. So we're trying to minimize the loss and so if you're adding the sum of the squares of the weights to the loss function then the SGD solver is going to have to try to avoid increasing the weights where it can and so we can pass to Keras at most Keras layers a parameter called W regularizer that stands for weight regularizer and we can tell it how to regularize our weights this say I say in this case I say do use an L2 norm that means sum of the squares of how much and that's something that I pass in and I used 1 a neg4 and it turns out if I do that and then I train it for a while it takes quite a lot longer to train but let's see if I've got this somewhere I got down to a loss of 0.7979 which is quite a bit better than the best results that that Stanford paper showed it's not quite as good as the neural net the neural net got 7938 at best but it's still interesting that that this pretty very simple approach actually gets results better than the academic state of the art as of 2012 or 2013 and I haven't been able to find more recent academic benchmarks than that okay so I took this model and I wanted to find out what how what we can learn from these results so obviously one thing that we would do with this model is just to make predictions with it so if you are building a website for predicting for recommending movies and a new user came along and said I like these movies this much what else would you recommend you could just go through and do a prediction for each movie for that user ID and tell them which ones had the highest numbers that's the normal way we would use collaborative filtering we can do some other things we can grab the top 2000 most popular movies just to make this more interesting and we can say let's just grab the bias term and I'll talk more about this particular syntax in just a moment but just for now this is a this is a particularly simple kind of model it's a model which simply takes the movie ID in and returns the movie bias out so in other words it doesn't look up in the movie bias table and just returns the movie bias indexed by this movie ID that's what these two lines do I then combine that bias with the actual name of each rating and print out the top and bottom 15 so according to movie lens the worst movie of all time is the church of Scientology classic battlefield earth sorry John Travolta so this is interesting because these ratings are quite a lot more sophisticated than your average movie rating what this is saying is that these have been normalized for some reviewers are more positive and negative than others some people are watching better or crappier films and others and so this bias is removing all of that noise and really telling us after removing all that noise these are the least good movies and battlefield earth even worse than spice world by a significant margin on the other hand here are the best Miyazaki fans will be pleased to see how's moving castle at number two so that's interesting perhaps what's more interesting though is to just check what I'm using here perhaps more interesting is to try and figure out what's going on in the not in the biases but in the latent factors the latent factors are a little bit harder to interpret because for every movie we have 50 of them in the excel spreadsheet we have five in our version we have 50 of them so what we want to do is we want to take from those 50 latent factors we want to find two or three kind of main components right the way we do this the details aren't important but a lot of you will already be familiar with it which is that there's something called pca or principal components analysis and principal components analysis does exactly what I just said it looks through a matrix in this case it's got 50 columns and it says what are the combinations of columns that we can kind of add together because they tend to move in the same direction and so in this case we say start with our 50 columns and I want to create just three columns that capture all of the information of the original 50 if you're interested in learning more about how this works pca is something which is kind of everywhere on the internet so there's lots of information about it but as I say the details aren't important the important thing to recognize is that we're just squishing our 50 latent factors down into three so if we look at the first pca factor and we saw it on it we can see that at one end we have um let's see we've got kind of you know fairly uh well regarded movies like the godfather hope fiction useful suspects these are all kind of things which I guess they're kind of classics at the other end we have things like a spin chura and robocop three which are perhaps not so classic so our first pca vector a pca factor is some kind of a is classic score on our second one we have something similar but actually very different that one end we've got 10 movies that are huge hollywood blockbusters with lots of special effects and at the other end we have um things like any hole and broke back mountain which are kind of dialogue heavy not big hollywood hits so there's another another dimension which this is kind of the second most this is the first most important dimension by which people judge movies differently this is the second most important one by which people judge movies differently and then the third most important one by which people judge movies differently is something where at one end we have a bunch of violent and scary movies and at the other end we have a bunch of very happy movies and for those of you who haven't seen babe australian movie happiest movie ever it's about a small pig and it's adventures and it's path to success so happiest movie ever according to movie lens so that's interesting right it's not saying that these factors are good or bad or anything like that it's just saying that these are the things that the when we've done this um this kind of this matrix decomposition have popped out as being the ways in which people are differing in their ratings um for different kinds of movies so one of the reasons I wanted to show you this is to say that um these kinds of sgd learned many parameter networks are not inscrutable you know where indeed it's not great to go in and look at every one of those 50 latent factor coefficients in detail but you have to think about how to visualize them how to look at them right and in this case actually when a step further and I grabbed a couple of principle components and tried drawing a picture um and so with pictures of course you can start to see things in multiple dimensions um and so here I've got the first and third principle components and so you can see the far right hand side here we have more of the um hollywood type movies and at the far left some of the more classic movies and at the top some of the more violent movies and some of the bottom some of the happier movies and babe so far happy that you can spread off the bottom um and so then if you kind of wanted to find a movie that was um violent and classic you would go into the top left and yeah cubrics a clockwork orange would probably be the one that most people would come up with first or if you wanted to come up with something that was um very hollywood and very nonviolent you would be down here in Sleekness in Seattle right so um you can really learn a lot by looking at a these kinds of models but you don't do it by looking at the coefficients you do it by visualizations you do it by interrogating it um and so I think this is a big difference but for any of you that have done much statistics before or have a background in the social sciences you've spent most of your time doing regressions and looking at coefficients and t tests and stuff this is a very different world this is a world where you're asking the model questions um and getting the results the modeled results which is kind of what we're doing here I mentioned I would talk briefly about this syntax um and this syntax is something that we're going to be using a lot more of and it's part of what's called the keras functional api the keras functional api is a way of doing exactly the same things that you've already learned how to do um using a different api and that is not such a dumb idea um the api you've learned so far is the sequential api it's where you use the word sequential and then you write in order the layers of your neural network okay that's all very well but what if you want to do something like what we want to just wanted to do just now where we had like two different things coming in we had a user id coming in and a movie id coming in and each one went through its own embedding and then they got model applied together how do you express that as a sequence it's not very easy to do that so the functional api was designed to answer this question the first thing to note about the functional api is that you can do everything you can do in the sequential api and here's an example of something you could do perfectly well with a sequential api which is something with two dense layers but it looks a bit different every functional api model starts with an input layer and then you assign that to some variable and then you list each of the layers in order and for each of them after you've provided the details for that layer you then immediately call the layer passing in the output of the previous layer so this passes in inputs and calls it x and then this is passes in our x and this is our new version of x and then this next dense layer gets the next version of x and returns predictions so you can see that each layer is saying what its previous layer is each layer is saying what its previous layer is so it's doing exactly the same thing as a sequential api just in a different way now as the docs note here the sequential model is probably a better choice to implement this particular network because it's easier right this is just showing that you can do it on the other hand the model that we just looked at would be quite difficult if not impossible to do with a sequential model api but with the functional api it was very easy we created a whole separate model called which gave an output u for user and that was the result of creating an embedding where we said an embedding has its own input and then goes through an embedding layer and then we returned the input to that and the embedding layer like so so that gave us our user input now user embedding and our movie input and our movie embedding so there's like two separate little models right and then we did a similar thing to create two little models for our bias terms they were both things that grabbed an embedding returning a single output and then flattened it and that grabbed our biases and so now we've got four separate models and so we can merge them there's this function called merge it's pretty confusing there's a small m merge and a big m merge in general you will be using the small m merge i'm not going to go to the details of why they're both there and they are there for a reason if something weird happens to you with merge try remembering to use a small m merge the small m merge takes two previous outputs that you've just created using the functional api and combines them in whatever way you want in this case the dot product and so then so that grabs our user and movie embeddings and takes the dot product we grab our the output of that and take our user bias and the sum and the output of that and the movie bias and the sum and so that's a functional api to creating that model and we have just a moment at the end of which we then use the model function to actually create a model saying what are the inputs to the model and what is the output of the model so you can see this is different to usual because we've now got multiple inputs so then when we call fit we now have to pass in an array of inputs a user id and the movie id okay so um the functional api is something that we're going to be using increasingly from now on now that we've kind of learnt all the basic architectures um just about we're going to be starting to build more exotic architectures for more special cases and we'll be using the functional api more and more it's the only reason to use an embedding layer so that you can provide a list of integers as input that's a great question is the only reason to use an embedding layer so that you can use integers and import absolutely yes so instead of using the embedding layer we could have one hot encoded all of those user id's and one hot encoded all of those movie id's and created dense layers on top of them and it would have done exactly the same thing um green box please why choose 50 latent factors and then reduce them down with a principal component analysis and not why not just have three latent factors to begin with i'm not quite sure why use both sure if we only use three latent factors if we only use three latent factors then our um our predictive model would have been less accurate so we want a accurate predictive model so that when people come to our website we can do a good job of telling them what movie to watch so 50 but fact latent factors for that but then for the purpose of our visualization of understanding what those factors are doing we want a small number so that we can interpret them more easily great okay so um one thing you might want to try during the week is taking um one or two of your models and converting them to use the functional api just as a little thing you could try to just to start to get the hang of how this API looks are these functional models how we would add additional information to images in cnn's say driving speed or turning radius yes um absolutely in general the idea of adding additional information to say a cnn is is basically like adding metadata right and this happens in collaborative filtering a lot um you might have a collaborative filtering model um that as well as having the uh ratings table you also have information about what genre the movie is in maybe the demographic information about the user um so you can incorporate all that stuff by having additional inputs and so yeah with a with a cnn for example it could be something uh i'll give you a good example the new kaggle fish recognition competition um one of the things that turns out is a useful predictor this is a leakage problem is the size of the image so you could have another input which is the height and width of the image just as integers and have that as a separate input which is concatenated to the output of your convolutional layer after the first flatten layer and then your dense layers can then incorporate both the convolutional um outputs and your metadata would be a good example that's a great question two great questions so you might remember from last week that this whole thing about collaborative filtering was a a journey to somewhere else and the journey is to nlp natural language processing this is a question about collaborative filtering yes so if we need to predict uh the missing values the nans of the 0.0 so if a user hasn't watched a movie like what would be the prediction or like how do we go about predicting that so the uh this is this is really the the the key purpose of creating this model is so that you can make predictions for movie user combinations you haven't seen before and the way you do that is to simply do something like this you just call model dot predict and pass in a movie id user id pair that you haven't seen before and all that's going to do is it's going to take the dot product of that movie's latent factors and that user's latent factors and add on those biases and return you back the answer um is that easy um and so this is like if if this was a cargo competition that would be how we would generate our submission for the cargo competition would be to take their test set which would be a bunch of movie user pairs that we hadn't seen before okay okay natural language processing so collaborative filtering is extremely useful um of itself i mean perhaps well without any doubt it is far more commercially important right now than nlp is having said that fast ai's mission is to impact society in as positive a way as possible and doing a better job as predicting movies is not necessarily the best way to do that so we're maybe less excited about collaborative filtering than some people in industry are so that's why it's not our main destination nlp on the other hand can be a very big deal if you can do a good job for example of reading through lots of medical journal articles or family histories and patient notes you could be a long way towards creating a fantastic diagnostic tool to use in the developing world to help bring medicine to people who don't currently have it which is almost as good as telling them not to watch battlefield earth they're both important so let's talk a bit about nlp um and in order to do this we're going to look at a particular data set and this data set is like a really classic example of what people do with natural language processing and it's called sentiment analysis sentiment analysis means that you take a piece of text it could be a phrase a sentence paragraph or a whole document and decide whether or not that is a positive or negative sentiment keras actually comes with such a data set which is called the imdb sentiment data set the imdb sentiment data set was originally developed from the stanford ai group um and uh the paper about it um i think it was actually published in 2012 is this one here and um they talk about um all the details about what people try to do with sentiment analysis in general although i think you know academic papers tend to be way more mathy than they should be um the kind of introductory sections often do a great job of kind of capturing why this is an interesting problem what kind of approaches people have taken and so forth um the other reason papers are super helpful is that you can skip down to the experiment section every machine learning paper pretty much has an experiment section and find out um what the score is okay so here's the score section so here they showed that using this data set they created of imdb movie reviews along with their sentiment um their full model plus an additional model um got a um score of 88.33 percent accuracy in predicting sentiment um they had another one here where they also added in some unlabeled data we're not going to be looking at that today that would be a semi-supervised learning problem so today our goal is to beat 88.33 is being the um academics data at the art for this data set at least as at this time so um the to grab it we can just say from keras.datasets import imdb keras actually kind of fiddles around with it in ways that i don't really like so i actually copied and pasted from the keras file these three lines to import it directly with that screwing with it um so that's why uh rather than using the keras data set directly i'm using this uh these three lines um there are 25 000 um uh movie reviews in the training set and here is an example of one uh realmwell high is a cartoon comedy around at the same time as some other programs blah blah blah blah right um so the data set actually does not quite come to us in this format it actually comes to us in this format which is a um list of ids right and so these ids then we can look up in um the word index which is something that they provide and so for example if we look at um so we can have a look here at idx array okay so the word index as you can see basically maps an integer to every word um it's in order of how frequent only those words appeared in this particular corpus which is kind of handy um so then i also create a reverse index so it goes from um word to id so um if i take so i can see that in the very first training example the very first word is word number 23022 so if i look up index to word 23022 it is the word romwell and so then i just go through and i map everything in that first review to index to word and join it together with a space and that's how we can turn the data that they give us back into the movie review um okay as well as providing the reviews um they also provide labels um one is positive sentiment zero is negative sentiment so our goal is to take these 25 000 reviews that look like this and predict whether it will be positive or negative in sentiment and the data is actually provided to us as a list of word ids for each review is everybody clear on the problem we are trying to solve and how it's laid out okay you guys are quick so there's a couple of things we can do to make it simpler one is we can reduce the vocabulary so currently there are some pretty unusual words like word number 23022 is bromwell and if we're trying to figure out um how to deal with all these different words having to figure out the various ways in which the word bromwell is used it's probably not going to net as much for a lot of computation and memory cost so we're going to truncate the vocabulary down to 5000 okay and it's very easy to do that because the words are already ordered by frequency i simply go through um everything in our training set and i just say if the um the word id is less than this vocab size of 5000 we'll leave it as it is otherwise we'll replace it with uh the number 5000 okay so at the end of this um we now have replaced all of our rare words with a single id um here's a quick look at the sentences the reviews are sometimes up to 2493 words long some people spend far too much time on indv uh some are as short as 10 words on average they're 237 words um as you will see we actually need to make all of our reviews the same length um so making using allowing this 24 2493 word review would again use up a lot of memory and time so we're going to decide to truncate every review at 500 words and that's twice as big more than twice as big as the mean so we're not going to lose too much so what we now need to do is we have to create a rectangular yes we have a question what if the word 5000 gives a bias um so the whatever the word 5000 is let's find out what the word 5000 is id x2 word 5000 okay it's the I guess the year 1987 um I mean so that's fine like it we're it's we're about to learn a machine learning model and so the vast majority of the time it comes across the word number 5000 it's actually going to mean rare word it's not going to specifically mean 1987 and it's going to learn to deal with that as best as it can um the idea is the rare words don't appear too often so hopefully this is not going to cause too much problem and doesn't just using frequencies favor stop words um we're not just using frequencies all we're doing is we're just truncating our vocabulary at this point so can you put that close to your mouth so yeah so the 5000 word can we just replace it with some neutral word do we not take care of that bias thing and there's really not going to be a bias here it's it's it's we're just taking a replacing it with a random id the fact that occasionally the word 1987 actually pops up it's it's it's totally insignificant we we could replace it with minus one it's it's just a sentinel value which has no meaning also we're getting rid of all the words that are like 5000 and beyond that's that's right so it's not just a single word it's it represents all of the like less common words yes exactly it's one of these um design decisions which don't it's not worth spending a lot of time thinking about because it's not significant so I just picked whatever happened to be easiest at the time um as I said like it's personally well was used minus one it just it's not important okay so um what is important is that uh we have to create um a square matrix uh so not square a rectangular matrix um which we can pass to our machine learning model um so quite conveniently keras comes with something called pad sequences that does that for us it takes everything greater than this length and truncates it and everything less than that length and pads it with whatever we asked for which in this case is zeros okay so at the end of this the shape of our training set is now a numpy array of 25 000 rows by 500 columns and as you can see it's padded the front with zeros such that it has 500 words in it other than that it's exactly the same as before and you can see bronwell has now been actually not replaced with 5000 but with 4999 okay so this is our same movie review again after going through that padding process this does it matter if we've had from the left or the right um I know that there's some reason that keras decided to pad the front rather than the back I don't recall what it is um since it's what it does by default I don't worry about it I don't think it's important okay so now that we have a rectangular matrix of numbers and we have some labels we can use the exact techniques we've already learned to create a model and as per usual we should try to create the simplest possible model we can to start with and we know that the simplest model we can is one with one hidden layer in the middle or at least this is the simplest model that we generally think ought to be pretty useful for just about everything now here is why we started with collaborative filtering and that's because we're starting with an embedding so if you think about it our input are word IDs and we want to convert that into a vector and that is what an embedding does okay so again rather than one hot encoding this into a 5000 column long huge input thing and then doing a matrix product an embedding just says look up that movie ID and grab that vector directly so it's just a computational and memory shortcut to creating a one hot encoding followed by a matrix product so we're creating an embedding where we are going to have 5000 latent factors or 5000 embeddings each one we're going to have 32 in this case 32 items rather than 50 so then we're going to flatten that have our single dense layer a bit of drop out and then our output to a sigmoid so that's a pretty simple model you can see the it's a good idea to go through and make sure you understand why all these parameter counts are what they are that's something you can do during the week and double-check that you're comfortable with all of those so this is the size of each of the weight matrices at each point and we can fit it and after two epochs we have in fact sorry that's already overfitting so we should just do one epoch after one epoch we have 88 accuracy on the validation set and so let's just compare that to stanford where they had 88.3 and we have we have 88.04 so we're not yet there but we're well on the right track two questions about the architecture why 32 this is always the question about you know why I have x number of filters in your convolutional layer or why have x number of outputs in your dense layer it's just a case of trying things and seeing what works and also kind of getting some intuition by looking at other models in this case I think 32 was the first I tried I kind of felt like from my understanding of like really big embedding models which we'll learn about shortly even 50 dimensions is enough to capture vocabularies of size 100,000 or more so I felt like 32 was likely to be more than enough to capture a vocabulary of size 5,000 I tried it and I got a pretty good result and so I've basically left it there if I if at some point I discovered that I wasn't getting great results I would try increasing it why sigmoid in the final layer oh um you can always use a softmax instead of a sigmoid it just means that you would have to change your labels because remember our labels were just ones or zeros here they were just a single single column if I wanted to use a softmax I would have to create two columns it wouldn't just be one it would be one zero one zero one zero so you can always in the past I've generally stuck to using softmax and then categorical cross entropy loss just to be consistent because then regardless of whether you have one two classes or more than two classes you can always do the same thing in this case I thought I want to show the other way that you can do this which is to just have a single column output and remember a sigmoid is exactly the same thing as a softmax if you just have a binary output and so here one is positive and zero is negative yeah and so rather than using categorical cross entropy we use binary cross entropy and again it's exactly the same thing it's just means I didn't have to worry about one hot encoding the output because it's just a binary output and do we know what the inter-rater agreement is for the status set no we don't it's not something I have looked at the important thing as far as I'm concerned is what is the benchmark that the standard people got and they compared it to a range of other previous benchmarks and they found that their technique was the best so that's my goal here and I'm sure there have been other techniques that have come out since that are probably better but I haven't seen them in any papers yet so this is my target okay so you can see that we can in one second of training get a accuracy which is pretty competitive and it's just a simple neural net and so hopefully you're starting to get a sense that a neural net with one hidden layer is a great starting point for nearly everything you know you now know how to create a pretty good sentiment analysis model and before today you did it so that's a good step just to confirm whenever we use binary cross entropy we use sigmoid yes yes okay and could you explain embedding again what is the actual input into the dense layer for word two three four five great this is an embedding so an embedding is something I think it would be particularly helpful if we go back to our movie movie lens recommendation data set and remember that the actual data coming in does not look does not look like this but it looks like this so when we then come along and say okay what do we predict the rating would be for user id one for movie id one one seven two we actually have to go through our our list of movie ids and find movie id number 31 say and then having found 31 then look up its latent factor and then we have to do the same thing for user id number one and find its latent factor and then we have to multiply the two together so that's step of taking an id and finding it in a list and returning the vector that it's attached to that is what an embedding is so an embedding returns a vector which is of length in this case 32 so the output of this is that for each the none always means your mini batch size so for each for each movie review for each of the 500 words in that sequence you're getting a 32 element vector and so therefore you have a mini batch size by 500 by 32 tensor coming out of this layer that gets flattened so 500 times 32 is 16 000 okay and then that is the input into your first dense layer and I also think it might be helpful to show that four of you instead of kind of having that in words that's being entered as a sequence of numbers where the number is yeah that's right that's right so we look at this first review and we take and remember this is now being truncated to 4999 this is still 309 so it's going to take 309 and it's going to look up the 309th vector in the embedding and it's going to return it and then it's going to concatenate to create this tensor so that's all what embedding is an embedding is a shortcut to a one hot encoding followed by a matrix product then two other questions can you show us words which have similar latent features I'm hoping these words would be synonyms or semantically similar yes we all see that shortly and who made the labels and why should I believe that it seems difficult and subjective well that's the whole point of sentiment analysis and these kinds of things is that it's totally subjective so the interesting thing about nlp is that we're trying to capture something which is very subjective so in this case you would have to read the original paper to find out how they got these particular labels the way that people tend to get labels is either in this case it's the IMDB dataset IMDB has ratings so you could just say okay anything higher than eight is very positive and anything lower than two is very negative and we'll throw away everything in the middle the other way that people tend to label academic data sets is to send it off to Amazon Mechanical Turk and pay them a few cents to label each thing so that's the that's the kind of ways that you can label stuff and there are places where people don't just use mechanical Turk but they like specifically try to hire like linguistics phd's or yeah you certainly wouldn't do that for this because the whole purpose here is to kind of capture normal people's sentiment yeah you would hire so for example when we know of a team of google that does that yeah so for example and I know when we were when I was in medicine we we went through all these radiology reports and tried to capture which ones were critical findings and which ones weren't critical findings and we used good radiologists rather than mechanical Turk for that purpose so we are not considering any sentence construction or diagrams or I think just a bag of words and the literal set of words that are being used in a comment it's not actually just a bag of words if you think about it this dense layer here has 1.6 million parameters it's connecting every one of those 500 inputs to our output so it's actually it's actually and not only that but it's doing that for everyone of the incoming factors so it's it's creating a pretty complex kind of big cartesian product of all of these weights and so it's it's taking account of the position of a word in the overall sentence it's not terribly sophisticated and there's not taking account of like its position compared to other words but it is taking its account of kind of whereabouts it occurs in the whole review so it's it's not like it's the dumbest kind of model I could come up with it's just it's a good starting point but we we would expect that with a little bit of thought which we're about to use we could do a lot better so why don't we go ahead and do that so the slightly better hopefully you guys have all predicted what that would be it's a convolutional neural network and the reason I hope you predicted that is because a we've already talked about how CNNs are taking over the world and b specifically they're taking over the world any time we have some kind of ordered data and clearly a sentence is ordered one word comes after another word it has a specific ordering so therefore we can use a convolution we can't use a 2d convolution because the sentence is not in 2d a sentence is in 1d so we're going to use a 1d convolution so a 1d convolution is even simpler than a 2d convolution we're just going to grab a string of a few words and we're going to take their embeddings and we're going to take that string and we're going to multiply it by some filter and then we're going to move that sequence along our sentence so this is you know our normal next place we go as we try to kind of gradually increase the complexity which is to grab our simplest possible CNN which is a convolution dropout max polling and then flatten that and then we have our dense layer and our output so this is exactly like what we did back when we're looking at gradually improving our state farm result but rather than having convolution 2d we have convolution 1d the parameters are exactly the same how many filters do you want to create and what is the size of your convolution originally i tried three here five turned out to be better so i'm looking at five words at a time and multiplying them by each one of 64 filters so that is going to return so we're going to start with the same embedding as before okay so we take our sentences and we turn them into a 500 by 32 matrix for each of our inputs we then put it through our convolution and because our convolution has a border mode of same we get back exactly the same shape that we gave it we then put it through our 1d max polling and that will have its size and then we stick it through the same dense layers as we had before so that's a really simple convolutional neural network for words compile it run it and we get 89.47 compared to let's go back to the video tape compared to without any unlabeled data 88.33 okay so we have already broken the academic state of the art as that when this paper was written and again um simple convolutional neural network gets us a very very long way okay and i was going to point out it's 10 to 8 so maybe we could have for a break but there's also a question um convolution 2d for images is easier to understand element wise multiplication in addition but what does it mean for a sequence of words um don't think of it as a sequence of words because remember it's been through an embedding right so it's a sequence of 32 element vectors right so it's doing exactly the same thing as um we were doing in a 2d convolution but rather than having three channels of color we have 32 channels of embedding right so we're just going through and we're just like in our convolution spreadsheet a convolution example okay remember how in the second one once we had two filters already our filter had to be a 3 by 3 by 2 tensor right in order to allow us to create the second layer for us we now don't have a 3 by 3 by 2 tensor we have a 5 by 1 by 32 or more conveniently a 5 by 32 matrix okay so each convolution is going to go through each of the five words and each of the 32 embeddings do an element multiplication element wise multiplication and add them all up okay so um the important thing to remember is that once we've done the embedding layer which is always going to be our first step for every nlp model is that we don't have words anymore we now have vectors which are attempting to capture the information in that word in some way just like our latent factors captured information about a movie and a user in our collaborative filtering um we haven't yet looked at what they do we will in a moment just like we did with the movie vectors um but we do know from our experience that sgd is going to try to fill out those 32 places with information about how that word is being used which allow us to make these predictions um just like when you first learned about 2d convolutions it took you probably a few days of fiddling around with spreadsheets and pieces of paper and python and checking inputs and outputs to get a really intuitive understanding of what a 2d convolution is doing you may find this the same with a 1d convolution um better it'll take you probably a fifth of the time to get there because you've really done all the hard work already um I yeah I think now is a great time to have a break so let's come back here at 757 so um there's there's a couple of um concepts that we come across from time to time in this class which um there is no way that me lecturing to you is going to be enough to get an intuitive understanding of it you know the first clearly is the 2d convolution you know and and hopefully you've had lots of opportunities to experiment and practice and read and these are things you have to tackle from many many different directions to understand a 2d convolution um and 2d convolutions in a sense are really 3d because if it's in full color you've got three channels hopefully that's something you've all played with and once you um have multiple filters later on in your image models you have still have 3d and you've got more than three channels you might have 32 filters or 60 64 filters um in this uh lesson we've introduced um one actually much simpler concept but it's still new which is the 1d the 1d convolution which is really a 2d convolution because just like with images we had red green blue now we have the 32 or whatever embedding factors um so that's something you will definitely need to you know experiment with create a model with just an embedding layer look at what the output is what is its shape what does it look like um um and then how does it uh how does a 1d convolution modify that um and then trying to understand what an embedding is uh is kind of your next you know big big task if you're not already feeling comfortable with it and if you haven't seen them before today I'm sure you won't right because this is a big new concept it's not in any way mathematically challenging it's literally looking up an array and returning the thing at that id right so uh an embedding looking at uh it'll move the id three is go to the third column of the matrix and return what you see that's all an embedding does right it's there couldn't be a mathematically simpler it's the simplest possible operation return the thing at this index but the kind of intuitive understanding of what happens when you put an embedding into an sgd and learn a vector which it turns out to be useful um is something which is kind of mind blowing because um as we saw from the movie lens example um with just a dot product and this simple look up something in an index operation we ended up with vectors which captured all kinds of interesting features about movies without us in any way asking it to so um I kind of wanted to I guess um just make sure that you guys really felt like after this class you're going to kind of go away and try and find a dozen different ways of looking at these concepts one of those ways is to look at how other people explain them and um Chris Ola has one of the very very best technical blogs I've come across and um I've quite often referred to in this class and in his understanding convolutions post he actually has a very interesting example of thinking about what a dropped ball does as a convolutional operation and he shows how you can think about a conval a 1d convolution using this dropped ball analogy particularly if you have some background in electrical or mechanical engineering I suspect you'll find this a very helpful example there are many resources out there for thinking about convolutions and I hope some of you will share on the forums any that you come across there are a few questions yes um so one this is from just before the break um essentially are we training the input to yeah we are absolutely training the input because the only input we have is um 25 000 sequences of 500 integers all right and so we uh take each of those integers and replace them with a lookup into a 500 column matrix um initially that matrix is random just like in our excel example we started with a random matrix these are all random numbers and if we go and then we created this loss function which was the sum of the squares of differences between the dot product and the rating and if we then use the gradient descent solver in excel to solve that it attempts to modify the um two embedding matrices and as you can see the objective is going down to try and come up with the two embedding matrices which give us the best approximation of the original rating matrix so this excel spreadsheet is something which you can play with and do exactly what our first movie lens example is doing in um in python um the only difference is that our version in python also has L2 regularization so this one's just finished here so you can see it's come up with these are no longer random we've now got two embedding matrices which have got the loss function down from 40 to 5.6 and so you can see for example these ratings are now very close to what they're meant to be so this is exactly what uh keras and sgd are doing in our python example so my question is is it that we got an embedding in which each word has um it's a vector of 32 elements yes exactly each word in our vocabulary of 5000 is being converted into a vector of 32 elements exactly right um another question is what would be the equivalent dense network if we didn't use a 2d embedding um this is in the initial model the simple one a dense layer with input of size embedding size of half size i don't actually don't know what that means sorry um yeah okay next question is um does it matter that encoded values which are close by are close in color in the case of pictures which is not true for word vectors for example 254 255 are close as um colors but for words they have no relation so the important thing to realize is that the word id's are not used mathematically in any way at all other than as an index to look up into an integer so the fact that this is movie number 27 the number 27 is not used in any way we just take the number 27 and find its vector okay so what's important is the values of each latent factor as to whether they're close together so in the movie example there was some latent factors that was something about is it a hollywood blockbuster and there were some latent factors that were something about um is it a violent movie or not um it's the similarity on those factors that matters the um the id is never ever used other than as an index to simply index into a matrix to return the vector that we found so as janette was mentioning in our case now for the word embeddings we're looking up in our um embeddings to return a 32 element vector of floats that are initially random and the model is trying to learn the 32 floats for each of our words that is semantically useful and in a moment we're going to look at some visualizations of that to try and understand what it's actually learned you have used the dropout on the embedding as well as on the next layer so what is the significance and what is the difference between the two great question so you can apply the dropout parameter to the embedding layer itself and what that does is it um zeroes out at random 20 of each of these 32 um embeddings for each word so it's basically avoiding overfitting the specifics of each words embedding this dropout on the other hand is removing uh at random um some of the words effectively some of the whole vectors um the significance of which one to use where um is not something which i've seen in anybody research in depth so i'm not sure that we have an answer that says use this amount in this place um i just tried a few different values in different places and um it seems that putting the same amount of dropout kind of in all these different spots seem to work pretty well in my experiments so it's a it's a reasonable rule of thumb um if you find you're massively overfitting or not massively underfitting try playing around with the various values and report back on the forum and tell us what you find you know maybe you'll find some different better configurations and i've come up with i'm sure some of you will great question okay so um let's think about what's going on here we are taking each of our 5000 words in our vocabulary and we are replacing them with a 32 element long vector which we are training to hopefully capture all of the information about what this word means and what it does and how it works now you might expect intuitively that somebody might have done this before just like with image net and vgg you can get a pre-trained network that says oh if you've got an image that looks a bit like a dog well we've had a trained a network which has seen lots of dogs and so it will probably take your dog image and return some useful predictions because we've done lots of dog images before the interesting thing here is your dog picture and the vgg authors dog pictures are not the same right they're they're going to be different in all kinds of ways and so to get pre-trained weights for images you have to give somebody a whole pre-trained network it's just like 500 megabytes worth of weights in a whole architecture words are much easier in a document the the word dog always appears the same way it's the word dog it doesn't have different lighting conditions or facial expressions or whatever it's just the word dog right so the cool thing is an nlp we don't have to pass around pre-trained networks we can pass around pre-trained embeddings or as they're commonly known pre-trained word vectors that is to say other people have already created big models with big text corpuses where they've attempted to build a 32 element vector or however long vector which captures all of the useful information about what that word is and how it behaves so for example if we type in word vector download you can see that this is not quite what we wanted let's do word embeddings download that's better lots of questions and answers and pages about where we can download pre-trained word embeddings so um that's pretty cool um but I guess uh what was a little unintuitive to me is that um I think this this means that if I can train a corpus on I don't know the works of Shakespeare somehow that tells me something about how I can understand movie reviews and I understand I imagine that like in in some sense that's true about how language is structured and whatnot but like the meaning of the word dog in Shakespeare is probably going to be used pretty differently um we're getting to that now okay so um so when you so um the word vectors that I'm going to be using and uh I don't strongly recommend but slightly recommend uh the glove word vectors the other main competition to these is called the word to veck word vectors um the glove word vectors come from a researcher named Jeffrey Pennington from stanford the word to veck word vectors come from google um I will have ever mentioned that the tensorflow documentation on the word to veck vectors is fantastic um so I would definitely highly recommend checking this out um the glove word vectors have been pre-trained on a number of different corpuses uh they have been pre-trained on one of them has been pre-trained on all of Wikipedia and a huge database full of newspaper articles a total of six billion words uh covering four hundred thousand uh size vocabulary and they provide 50 dimensional hundred dimensional 200 dimensional and 300 dimensional pre-trained vectors they have another one which has been trained on 840 billion words uh of a huge dump of the entire internet uh and then they have another one which has been trained on two billion tweets on which I believe all of the Donald Trump tweets have been carefully cleaned out prior to usage um so in my case what I've done is I've downloaded the um six billion token version um and I will show you what one of these looks like so here is are you are you losing context because of um like capital letters punctuation we'll look at that in a moment um sometimes um these uh uh uh cased okay so you can see for example this particular one includes case there are 2.2 million items of vocabulary in this sometimes they're uncased so um uh a more look at punctuation moment here is um the start of the glove 50 dimensional word vectors trained on a corpus of six billion here is the word the word and here are the 50 vectors which the 50 floats which attempt to capture all of the information in the word the punctuation or here is the word full stop okay and so here are the 50 floats that attempt to capture all of the information captured by a full stop um so here is the word in here is the word double quote here is apostrophe s okay so you can see that the glove authors have tokenized their text in a very particular way and the idea that apostrophe s should be treated as a thing I mean that makes a lot of sense you know it certainly has that thinginess in the English language and so indeed the way the authors of a word embedding corpus have chosen to tokenize their text um definitely matters and one of the things I quite like about glove is that they've been pretty smart in my opinion about how they've done this what was the target variable so the question is how does one create word vectors in general what is the model that you're creating and you know what are the labels that you're building the um yeah so um one of the things that we talked about getting to at some point is unsupervised learning and this is a great example of unsupervised learning we want to take 840 billion tokens of an internet dump and build a model of something right so what do we build a model of and this is a case of unsupervised learning we're trying to capture some structure of this data in this case how does English look work and feel the um the way that this is done at least in the word to veck example is quite cool what they do is they take every um uh sentence of say 11 words long right so they take not just every sentence but every 11 long string of words that appears in the corpus and then they take the middle word well the first thing they do is they create a copy of it an exact copy right and then in the copy they delete the middle word and replace it with some random word right so we now have two strings of 11 words one of which makes sense because it's it's real one of which probably doesn't make sense because the middle word has been replaced with something random and so the model task that they create is the label is one if it's a real sentence or zero if it's a fake sentence and that's that's the task they give it so you can see it's not a um directly useful task in any way you know unless somebody actually comes along and says I just found this corpus in which somebody's replaced half of the middle words with random words but it is something where in order to be able to tackle this task you're going to have to know something about language you're going to have to be able to recognize that this sentence doesn't make sense and this sentence does make sense so this is a great example of unsupervised learning generally speaking in deep learning unsupervised learning means coming up with a task which is as close to the task you're eventually going to be interested in as possible but that doesn't require labels or where the labels are really cheap to generate just behind you ritual we're just thinking on the language aspect now we're talking about just English so because it's just a stream of some just a stream of tokens you know writings so the different languages we are still turning them into a vector of loads so how does that you know or change across languages or the mixed language text so it turns out that the embeddings that is created when you look at say Hindu and Japanese turn out to be nearly the same and so one way to translate language is to create a bunch of word vectors in English for various words and then to create a bunch of word vectors in say Japanese for various words and then what you can do is you can say okay I want to translate this word which might be queen to Japanese you can basically look up and find the nearest word in the same vector space in the Japanese corpus and it turns out it works right so it's a fascinating thing about language is that in fact the Google has just announced that they've replaced Google translate with a neural translation system and part of what that is doing is basically doing this in fact here are some interesting examples of some word embeddings the word embedding for king and queen has the same distance and direction as the word embeddings for man and woman ditto for walking versus walked and swinging versus swam and ditto for Spain versus Madrid and Italy versus Rome right so the the embeddings that have to get learned in order to solve this stupid meaningless random sentence task are quite amazing right and so I've actually downloaded those glove embeddings and I've pre-processed them and I'm going to upload these for you shortly into a into a form that's going to be really easy for you to use in python and I've created this little thing called load glove which loads the pre-processed stuff that I've created for you and it's going to give you three things it's going to give you the word vectors which is the 400 000 by in this case 50 dimensional vectors a list of the words here they are the comma dot of two right and a list of the word indexes so you can now take a word and call word to vek to get back its 50 dimensional array and so then I drew a picture in order to turn a 50 dimensional vector into something two dimensional that I can plot we have to do something called dimensionality reduction and there's a particular technique the details don't really matter called t-sne which attempts to find a way of taking your high dimensional information and plot it on two dimensions such that things that were close in the 50 dimensions are still close in the two dimensions and so I use t-sne to plot the first 350 most common words and here they all are and so you can see that it's a punctuation of appeared net close to each other numerals appear close to each other written versions of numerals are close to each other seasons games leagues played are all close to each other various things about politics school and university president general prime minister and bush now this is a great example of where this t-sne two dimensional projection is misleading about the level of complexity that's actually in these word vectors in a different projection bush would be very close to tree right the two dimensional projection is losing a lot of information right the the true detail here is a lot more complex than us mere humans can see on a on a page but hopefully you kind of get a sense of this right so all I've done here is I've just taken those 50 dimensional word vectors and I've plotted them in two dimensions and so you can see that when you learn a word embedding you end up with um something we've we've now seen we're not just a word embedding we've seen for movies we know we we were able to plot some movies in two dimensions and see how they relate to each other and we can do the same thing for words in general when you have some high dimensional high cardinality categorical variable whether it be lots of movies or lots of reviewers or lots of words or whatever you can turn it into a useful lower dimensional space using this very simple technique of creating an embedding the explanation on how unsupervised learning was used in word to beck was pretty smart how was it done in glove I don't recall how it was done in glove I believe it was something similar I should mention though that both glove and word to beck did not use deep learning they actually tried to create a linear model and the reason they did that was that they specifically wanted to create representations which had these kinds of linear relationships because they felt that this would be a useful characteristic of these representations I'm not even sure if anybody has tried to create a similarly kind of useful representation using a deeper model and whether that turns out to be better obviously with these linear models it saves a lot of computational time as well the embeddings however or even though they were built using linear models we can now use them as inputs to deep models which is what we're about to do just behind you Rachel so google syntax net model that just came out is that the one you were mentioning before no I was mentioning word to beck word to beck has been around for two and a half years two years syntax net is a whole framework so I think it might have been the what the I think it's called parsley yeah parsley McPars face yeah that one is the one where they claim 97 percent accuracy on nlp and it also returns parts of speech so also you figure this out and it'll say this is word this is the action right all of these word vectors do all of these things right in that high-dimensional space for example you can see here is information about tests for example so it's very easy to take a word vector and use it to create a part of speech recognizer you just need a fairly small labeled corpus and it's actually pretty easy to download a rather large labeled corpus and build a simple model that goes from word vector to part of speech yeah I mean there's a there's a really interesting paper called exploring the limits of language modeling that parsley McPars face thing got far more PR than it deserved it was a it was not really an advance over the state-of-the-art language models of the time but since that time there have been some much more interesting things and one of the interesting papers is this exploring limits of language modeling which is looking at what happens when you take a very very very very large data set and spend shitloads of Google's money on lots and lots of GPUs for a very long time and they have some genuine massive improvements to the state-of-the-art in in kind of language modeling but in general when we're talking about language modeling we're talking about things like you know and everything from is this a noun or a verb to is this a happy sentence or a sad sentence to is this a formal speech or an informal speech so on and so forth you know and all of these things that nlp researchers do we can now do super easily with these embeddings and did it under the hood use the optimizer as item which automatically did generate all of those factors which would translate into some of these work so this is this uses two techniques which one of which you know and one of which you're about to know convolutional neural networks and recurrent neural networks specifically a type called lstm and so you can check out this paper to see how they compare since this time there's been an even newer paper that that has furth at the state-of-the-art in language modeling and it's using a convolutional neural network so right now cnn's with pre-trained word embeddings are the state-of-the-art so given that we can now download these pre-trained word embeddings that leads to the question of why are we using randomly generated word embeddings when we do our sentiment analysis that doesn't seem like a very good idea and indeed it's not a remotely good idea you should never do that from now on you should now always use pre-trained word embeddings anytime you do nlp there's and and over the next few weeks we will be gradually making this easier and easier at this stage it requires slightly less than a screen of code you have to load the embeddings off disk creating your word vectors your words and your word indexes and the next thing you have to do is the word indexes that come from glove are going to be different to the word indexes in your vocabulary right in our case this was the word bromwell in the glove case it is probably not the word bromwell so this little piece of code is simply something that is mapping from one index to the other index okay so this create embedding function is then going to convert our create an embedding matrix where the indexes are the indexes in the imdb dataset and the embeddings are the embeddings from glove okay so that's what emb now contains this embedding matrix are the glove word vectors indexed according to the imdb dataset so now i have simply copied and pasted the previous code and i have added this weights equals by pre-trained embeddings since we think these embeddings are pretty good i've set trainable to false i won't leave it at false because we're going to need to fine-tune them but we'll start at false one particular reason that we can't leave it at false is that sometimes i have had to create a random embedding because sometimes the word that i looked up in glove didn't exist for example anything that finishes with apostrophe s in glove they tokenize that to have apostrophe s and the word as separate tokens but in imdb they were combined into one token and so all of those things there aren't vectors for them so i just randomly created embeddings for anything that i couldn't find in the glove dictionary in the glove vocabulary anyway but for now let's start using just the embeddings that were given and we will set this to non trainable and we will train a convolutional neural network using those embeddings for the imdb task and after two epochs we have 89.8 previously with random embeddings we had 89.5 and the academic standard the art was 88.3 so we made significant improvements um let's now go ahead and say first layer trainable is true decrease the learning rate a bit and do just one more epoch and we're now up to 90.1 right so we've got way beyond the academic state of the art here um we're kind of cheating because we're now not just building a model we're now using a pre-trained word embedding model that somebody else has provided for us um but why would you ever not do that now that exists right so you can see that we've had a big jump and further more it's only taken us 12 seconds to train this network right so we started out with the pre-trained word embeddings we set them initially to non trainable in order to just train the where are we we set them to non trainable in order to just train the layers that used them um waited until that was stable which took really two epochs um and then we set them to trainable and did one more little fine tuning step and this kind of approach of these like three epochs of training is likely to work for you know a lot of the nlp stuff that you'll find in the world so i do not need to compile the model after resetting the input layer to trainable equals true um and um no you don't um because the architecture of the model has not changed in any way um it's just kind of changed the metadata attached to it there's never any harm in compiling the model um sometimes if you forget to compile um it just continues to use the old model um so best to err on the side of using it all right um something that i thought was pretty cool is that during the week one of our students here had an extremely popular post appear all over the place i saw it on the front page of hack and use talking about how his company quid uses deep learning and very happy to see with small data which is what we're all about for those of you who don't know it um quid is a company uh quite a successful startup actually that is processing millions and millions of documents things like patents and stuff like that and providing enterprise customers with really cool visualizations and interactive tools that'll let's let's them analyze huge data sets and so this is by ben bols one of our students here and he talked about how he compared three different approaches to um a particular nlp classification task one of which involved some pretty complex and slow to develop carefully trained carefully engineered features but model three in this example was a convolutional neural network um so i think this is pretty cool and i was hoping to talk to ben about this piece of work where is it could you give us a little bit of context on what you are doing in this project yeah so um the task is about detecting marketing language um from company descriptions so it's you know had the flavor of like being very similar to sentiment analysis and you've got like you have two classes of things you know they're they're kind of different in some kind of semantic way you know you can do some examples here so one was like our patent pending support system is engineered designed to be in competent style was your more marketing i guess and your spatial scanning software for mobile devices is your more informative yeah i mean you're like the semantics of the marketing language is like oh this is exciting you know like this is you know there are certain types of meanings and semantics around which you know the marketing tends to cluster and i sort of realized hey you know like this would be kind of a nice task for for deep learning how are these labeled your data set in the first place um basically by a couple of us you know at the company we just basically just found some good ones and found the bad ones and then like literally tried it out i mean it's like clearly as as hacky as you could possibly imagine so uh yeah it was kind of what's you know super super scrappy but it actually ended up being very useful for us i think that's like that kind of a nice lesson is sometimes you know scrappy gets you most of the way you need them you know you're thinking about like hey how do you get your data for your project well you can actually just create it right and then you know exactly you know i mean i i i love this lesson because like when and so startup right when i talk to big enterprise executives you know they're all about their five-year metadata and data lake repository infrastructure program at the end of which maybe they'll actually try and get some value out of it where all startups are just like okay what have we got that we can do by like Monday you know let's throw it together and see if it works and the latter approach is so much better because you know by Monday you know whether it kind of looks good you know which kind of things are important and you know you can decide on how much it's worth investing in so that's cool so one of the things i wanted to show is your convolutional neural network did something pretty neat and so i wanted to use this same neat trick for our convolutional neural network and it's a multi-size cnn so i mentioned earlier that when i built this cnn i tried using a filter filter size of five and i found it better than three and what ben in his blog post points out is that there's a neat paper in which they describe doing something interesting which is not just using one sized convolution but trying a few sized convolutions and and you can see here this is a great use of the functional api right so what and i haven't exactly used your code i've kind of rewritten a little bit then but basically it's the same concept let's try size three and size four and size five convolutional filters and so let's create a 1d convolutional filter of size three and then size four and then size five and then for each one using the functional api we'll add max pooling and we'll flatten it and we'll add it to a list of these different convolutions and then at the end we will merge them all together by simply concatenating them so we're now going to have a single vector containing the result of the three and four and five sized convolutions like y set off for one and then let's return that home model that is as a little sub model which which um in ben's code he called graph and the reason i assume you call this graph is because people tend to think of these things they call them a computational graph a computational graph basically is saying this is a a computation being expressed as various inputs and outputs you can think of it as a graph so once you've got this little multi-layer convolution module you can stick it inside a standard sequential model by simply replacing the convolution 1d and max pooling piece with graph where graph is the concatenated version of all of these different scales of convolution and so trying this out i got a slightly better answer again which is 90.36 percent so and i hadn't seen that paper before so thank you for um uh giving that great idea did you have anything to add about this multi-scale convolution uh idea um not really other than i think it's super cool yeah yeah but actually you know i'm still trying to figure out all the ins and outs of exactly how it works um you know in some ways implementation is easier than understanding well that's exactly right i mean a lot of these things the the the kind of the math is the math is kind of ridiculously simple um and then you throw it at an sgd and let it do billions and billions of calculations in a fraction of a second and what it comes up with is kind of hard to grasp and you are using capital and merge in this example that you want to talk about that not really ben used capital m merge and i just did the same thing um i'll post a link i like yeah we're at me i would have used small m merge so you know we'll have to agree to disagree here um yeah okay no let's not go there so i think that's super fun so we have a few minutes to talk about something enormous so we're going to do a brief introduction to rnn's and then next week we will do a deep dive so everything we've learned so far um about convolutional neural networks does not necessarily do a great job of solving a problem like how would you model this now notice this whatever this markup is i'm not quite sure um it has to recognize when you have a start tag and know to close that tag but then over a long period of time that it's inside a weird xml e comment thing and to know that it has to finish off the weird xml e comment thing which means it has to kind of keep memory about what happened in the distant past if you're going to successfully do any kind of modeling with data that looks like this all right and so in order to have that kind of with that kind of memory therefore it can handle long-term dependencies right um also think about these two different sentences they both mean effectively the same thing but in order to realize that you're going to have to keep some kind of state that knows that after this has been read in you're now talking about something that happened in 2009 and you then have to remember it all the way to here to know what it was when it was that this thing happened that you did in Nepal so we want to create some kind of stateful representation furthermore it would be nice if we're going to deal with big long pieces of language like this with a lot of structure to be able to handle variable length sequences so that we can handle some things that might be really long and some things that might be really short so these are all things which convolutional neural networks don't necessarily do that well so we're going to look at something else which is a recurrent neural network which handles that kind of thing well and here is a great example of a good use of a recurrent neural network um at the top here you can see that there is a convolutional neural network that is looking at images of street numbers house numbers sorry and these images are coming from really big google street view pictures and so it has to figure out what part of the image should I look at next in order to figure out the house number and so you can see that there's a little square box that is scanning through and figuring out I want to look at this piece next and then at the bottom you can see it's then showing you what it's actually seeing after each time step so the thing that is figuring out where to look next is a recurrent neural network it's something which is taking its previous state and figuring out what should its next state be and this kind of model is called an attentional model and it's a really interesting avenue of research when it comes to dealing with things like very large images images which might be too big for a single convolutional neural network with our current hardware constraints on the left is another great example of a useful recurrent neural network which is the very popular android and ios um text entry system called SwiftKey and SwiftKey had a post up a few months ago in which they announced that they had just replaced their language model with a neural network of this kind which basically looked at your previous words and figured out what word are you likely to be typing in next and then it could predict that word a final example was um Andre Kapathy showed a really cool thing where he was able to generate um random mathematical papers by generating random latex and to generate random latex you actually have to learn things like slash begin proof and slash end proof and these kind of long term dependencies and he was able to do that successfully so this is actually a um randomly generated piece of latex which is being created with a recurrent neural network so um today i am not going to show you exactly how it works i'm going to kind of try to give you an intuition right and i'm going to start off by showing you how to think about um neural networks as computational graphs so this is coming back to that word ben used earlier this idea of a graph right and so i started out by trying to draw this is like my notation you won't see this anywhere else but it'll do for now here is a picture of a single hidden layer basic neural network we can think of it as having an input right which is going to be of size batch size and contain width of number of inputs right and then this arrow its orange arrow represents something that we're doing to that matrix so each of the boxes represents a matrix and each of the arrows represents one or more things we do to that in this case we do a matrix product and then we throw it through a rectified linear unit and then we get a circle which represents it's a matrix but it's now a hidden layer which is of size batch size by number of activations and number of activations is just when we created that dense layer we would have said dense and then we would have had some number right and that number is how many activations we create and then we put that through another operation okay which in this case is a matrix product followed by a softmax and so triangle here represents an output matrix and that's going to be of size batch size by if it's image net a thousand say okay so this is my little way of representing the computation graph of a basic neural network with a single hidden layer okay I'm now going to create some slightly more complex models and but I'm going to slightly reduce the amount of stuff on the screen one thing to note is that batch size appears all the time so I'm going to get rid of it all right so here is the same thing where I've removed batch size also the specific activation function who gives a shit right it's probably rally you everywhere except the last layer where softmax so I've removed that as well okay so let's now look at what a convolutional layer sorry convolutional neural network with a single dense hidden layer would look like so we'd have our input which this time will be and remember I've removed batch size number of channels by height by width okay the operation and we're ignoring the activation function is there going to be a convolution followed by a max pool remember any shape is representing a matrix okay so that gives us a matrix which will be size num filters by height over 2 by width over 2 since we did a max pooling and then we take that and we flatten it I've put flattening parentheses because like flattening mathematically does nothing at all right flattening is just like telling keras to think of it as a vector okay it doesn't actually calculate anything it doesn't move anything it doesn't really do anything it just says think of it as being a different shape now that's why I put it in parentheses okay so let's then take a matrix product and remember I'm not putting in the activation functions anymore so that would be our dense layer gives us our first completely connected layer which will be of size number of activations and then we put that through a final matrix product to get an output of size number of classes right so here is how we can represent a convolutional neural network with a single dense hidden layer the number of activations again is the same as we had last time it's whatever the n was that we wrote dense n okay so just like when we the number of filters is when we write convolution 2d we say number of filters followed by its size so I'm going to now create a slightly more complex computation graph but again I'm going to slightly simplify what I put on the screen which is this time I am going to remove all of the layer operations because now that we have removed the activation function you can see that in every case we basically have either we have some kind of linear thing either a matrix product or a convolution and optionally there might also be a max pull right so really this is not adding much additional information so I'm going to get rid of it from now on okay so we're now not showing the layer operations so remember now every arrow is representing some kind of layer one or more layer operations which will generally be a convolution or a matrix product followed by an activation function and maybe there'll be a matrix sorry a max pulling in there as well so let's say we wanted to predict the third word of a three word string based on the previous two words right now there's all kinds of ways we could do this but here is one interesting way which you will now recognize you could do with keras's functional API which is we could take word one input right and that could be either a one hot encoded thing in which case its size would be vocab size or it could be an embedding of it okay doesn't really matter either way we then stick that through a layer operation to get a matrix output which is our first fully connected layer right and this thing here we could then take and put through another layer operation but this time we could also add in the word to input again either of vocab size or the embedding of it right put that through a layer operation of its own and then when we have two arrows coming in together that represents a merge right and a merge could either be done as a sum or as a concat i'm not going to say one's better than the other but like they're just there are two ways that we can take two input vectors and combine them together so now at this point we have the input from word two after sticking that through a layer we have the input from word one after sticking that through two layers merge them together and stick that through another layer to get our output which we could then compare to word three and try to train that to recognize word three from words one and two so you could try this you know you could try try and build this network using some corpus you find online see how it goes pretty obviously then you could bring it up another level to say let's try and predict the fourth word of a three word string using words one and two and three now the reason i'm doing it in this way is that what's happening is each time i'm going through another layer operation right and then bringing in word two and going through a layer operation and bringing in word three and going through a layer operation is i am collecting state right each of these things has the ability to capture state about all of the words that have come so far and the order in which they've arrived right so that by the time i get to predicting word four right this matrix has had the opportunity to learn what does it need to know about the previous words orderings and how they're connected to each other and so forth in order to predict this fourth word right so we're actually capturing state here right and it's important to note we have not yet previously built a model in keras which has input coming in at anywhere other than kind of the first layer but there's no reason we can't one of you asked a great question earlier which was could we use this to bring in metadata like the speed a car was going to add it with a convolutional neural networks image data i said yes we can so in this case we're doing the same thing which is we're bringing in an additional words worth of data and remember each time you see two different arrows coming in that represents a merge operation so here's a perfectly reasonable way of trying to predict the fourth word from the previous three words so this leads to a really interesting question which was what if instead we said let's bring in our word one and then we had a layer operation in order to create our hidden state right and that would be enough to predict word two right and then to predict word three could we just do a layer operation and generate itself right and then that could be used to predict word three and then run it again to predict word four and run it again to predict word five right this is called an RNM and everything that you see here is exactly the same structurally as everything I've shown before the color in areas represent matrices and the arrows represent layer operations one of the really interesting things about an RNM is each of these arrows that you see three arrows there's only one weight matrix attached to those in other words it's the equivalent thing of saying every time you see a arrow from a circle to a circle so that would be that one and that one those two weight matrices have to be exactly the same every time you see an arrow from a rectangle to a circle those three matrices have to be exactly the same and then finally you've got an arrow from a circle to a triangle and that weight matrix is separate the idea being that if you have a word coming in and being added to some state why would you want it to treat it differently depending on whether it's the first word in a string or the third word in a string given that generally speaking we kind of split up strings pretty much at random anyway that we're going to be having a whole bunch of like 11 word strings say so the nice thing about one of the nice things about this way of thinking about it where you have it going back to itself is that you can very clearly see there is one layer operation one weight matrix for input to hidden one for hidden to hidden circle for circle and one for hidden to output i.e circle to triangle so we're going to talk about that in a lot more detail next week for now i'm just going to quickly show you something in the last one minute which is that we can train something which takes for example the text of all of the text of niche or niche right so here's a bit of his text i've just read it in here and we could split it up into every sequence let's grab it here uh into every sequence of length 40 so i've gone through the whole text and grabbed every sequence of length 40 and then i've created an RNN and its goal is to take the sentence which represents the indexes from i to i plus 40 and predict the sentence from i plus one to i plus 40 plus one right so every string of length max len i'm trying to predict the string one word after that right and so i can take that now and create a model which has an lstm is a kind of recurrent neural network we'll talk about it next week which has a recurrent neural network starts of course with an embedding and then i can train that by passing in my sentences and my sentence one character later and i can then say okay let's try and generate 300 characters by building a prediction of what do you think the next character would be and so i have to seed it with something so i seeded it with something i don't know what's felt very niche in ethics is a basic foundation of all that and see what happens right and after training it for only a few seconds i get ethics is a basic foundation of all that you can get the sense it's starting to learn a bit about the idea that oh by the way one thing to mention is this niche the corpus it's slightly annoying it has carriage returns after every line so you'll see it's going to throw carriage returns in all over the place and it's got some pretty hideous formatting so then i train it for another so that was after about training for 30 seconds i train it for another 30 seconds and i get to a point where it's kind of understanding the concept of punctuation and spacing okay and then i've trained it for 640 seconds and it's starting to actually create real words and then i've trained it for another 640 seconds and interestingly each section of niche starts with a kind of a numbered section that looks exactly like this it's even starting to learn to close its quotation marks it also notes that at the start of a chapter it always has three lines so it's learnt to start chapters after another 640 seconds and another 640 seconds and so by this time it's actually got to a point where it's saying some things which are so obscure and difficult to understand it could really be niche so i mean these car rnn models are fun and all but the reason that this is interesting is that we're showing that we only provided that amount of text and it was able to generate text out here right because it has state it has recurrence and what that means is that we could use this kind of model to generate something like swift key whereas you're typing it's saying this is the next thing you're going to type okay i would love you to think about during the week whether this is likely to help our imdb sentiment model or not that'll be an interesting thing to talk about and uh yeah next week we will look into the details of how rnn's work thanks gang