 So welcome to the last lesson of part one of practical deep learning for coders. It's been a really fun time doing this course. And depending on when you're watching and listening to this, you may wanna check the forums or the fast.ai website to see whether we have a part two planned. Which is gonna be sometime towards the end of 2022. Or if it's already past that, then maybe there's even a part two already on the website. So part two goes a lot deeper than part one. Technically, in terms of getting to the point that you should be able to read and implement research papers and deploy models in a very kind of real life situation. So yeah, last lesson we started on the collaborative filtering notebook. And we were looking at collaborative filtering. And this is where we got to, which is creating your own embedding module. And this is a very cool place to start the lesson. Because you're gonna learn a lot about what's really going on. And it's really important before you dig into this to make sure that you're really comfortable with the zero five, any a model neural net from scratch notebook. So if parts of this are not totally clear, put it aside and redo this notebook. Because what we're looking at from here are kind of the abstractions that PyTorch and FastAI add on top of functionality that we've built ourselves from scratch. So if you remember, in the neural network from scratch we built, we initialized a number of coefficients, a couple of different layers and a bias term. And then as the model trained, we updated those coefficients by going through each layer of them and subtracting out the gradients by the learning rate. In, you've probably noticed that in PyTorch, we don't have to go to all that trouble. And I wanted to show you how PyTorch does this. PyTorch, we don't have to keep track of what our coefficients or parameters or weights are. PyTorch does that for us. And the way it does that is it looks inside our module and it tries to find anything that looks like a neural network parameter or a tensor of neural network parameters. And it keeps track of them. And so here is a class we've created called T which is a subclass of module. And I've created one thing inside it, which is something with the attribute A. So this is A in the T module. And it just contains three ones. And so the idea is, maybe we're creating a module and this is, we're initializing some parameter that we want to train. Now we can find out what trainable parameters or just what parameters in general PyTorch knows about in our model by instantiating our model and then asking for the parameters. Which you then have to turn that into a list or in Fastcore we have a thing called capital L which is like a fancy list which prints out the number of items in the list and shows you those items. Now in this case, when we create our object of type T and ask for its parameters, we get told there are zero tensors of parameters and a list with nothing in it. Now why is that? We actually said we wanted to create three, a tensor with three ones in it. How would we make those parameters? Well, the answer is that the way you create, the way you tell PyTorch what your parameters are is you actually just have to put them inside a special object called an nn.parameter. This thing almost doesn't really do anything. In fact, last time I checked, it really quite literally had almost no code and sometimes these things change, but let's take a look. Yeah, okay, so it's about a dozen lines of code or 20 lines of code which does almost nothing. It's got a way of being copied. It's got a way of printing itself. It's got a way of saving itself and it's got a way of being initialized. So a parameter hardly does anything. The key thing is though that when PyTorch checks to see which parameters should it update when it optimizes, it just looks for anything that's been wrapped in this parameter class. So if we do exactly the same thing as before, which is to set an attribute containing a tensor with three ones in it, but this case, we wrap it in a parameter. We now get told, okay, there's one parameter tensor in this model and it contains a tensor with three ones. And you can see it also actually by default assumes that we're gonna want require gradient. It's assuming that anything that's a parameter is something that you wanna calculate gradients for. Now, most of the time we don't have to do this because PyTorch provides lots of convenient things for us, such as what you've seen before, nn.linear, which is something that also contains, creates a tensor. So this would contain a, create a tensor of one by three. Without a bias term in it. This has not been wrapped in an nn.parameter, but that's okay. PyTorch knows that anything which is basically a layer in a neural net is gonna be a parameter. So it automatically considers this a parameter. So here's exactly the same thing. Again, I'll construct my object of type T of check for its parameters. And I can see there's three of, one tensor of parameters and there's our three things. And you'll notice that it's also automatically randomly initialize them, which again is generally what we want. So PyTorch does go to some effort to try to make things easy for you. So this attribute A is a linear layer and it's got a bunch of things in it. One of the things in it is the weights and that's where you'll actually find the parameters that is of type parameter. So a linear layer is something that contains attributes of type parameter. Okay, so what we wanna do is we wanna create something that works just like this did, which is something that creates a matrix which will be trained as we train the model. Okay, so an embedding is something which, yeah, it's gonna create a matrix of this by this. And it will be a parameter and it's something that, yeah, we need to be able to index into as we did here. And so, yeah, what is happening behind the scenes? You know, we're in PyTorch. It's nice to be able to create these things ourselves on scratch because it means we really understand it. And so let's create that exact same module that we did last time. But this time we're going to use a function I've created called createPyrams. You pass in a size, so such as in this case, n uses by n factors. And it's gonna call torch.zeros to create a tensor of zeros of the size that you request. And then it's going to do a normal random distribution, so a Gaussian distribution of mean zero, standard deviation 0.01 to randomly initialize those. And it'll put the whole thing into an nn.parameter. So this here is gonna create an attribute called user factors, which will be a parameter containing some tensor of normally distributed random numbers of this size, excuse me. And because it's a parameter that's gonna be stored inside, that's gonna be available as in parameters in the module. Oh, I'm sneezing. So user bias will be a vector of parameters. User factors will be a matrix of parameters. Movie factors will be a matrix and movies by n factors. Movie bias will be a vector of n movies. And this is the same as before. So now in the forward, we can do exactly what we did before. The thing is when you put a tensor inside a parameter, it has all of the exact same features that a tensor has. So for example, we can index into it. So this whole thing is identical to what we had before. And so that's actually, believe it or not, all that's required to replicate PyTorch's embedding layer from scratch. So let's run those and see if it works. And there it is, it's training. So we'll be able to have a look when this is done at, for example, model dot, let's have a look, movie bias. And here it is, right? It's a parameter containing a bunch of numbers that have been trained. And as we would expect, it's got 1,665 things in because that's how many movies we have. So a question from Jonah Raphael was, does torch dot zeros not produce all zeros? Yes, torch dot zeros does produce all zeros. But remember a method that ends in underscore changes in place, the tensor it's being applied to. And so if you look up PyTorch normal underscore, you'll see it fills itself with elements sampled from the normal distribution. So this is actually modifying this tensor in place. And so that's why we end up with something which isn't just zeros. Now, this is the bit I find really fun, is we're trained this model, but what did it do? How is it going about predicting who's gonna like what movie? Well, one of the things that's happened is we've created this movie bias parameter, which has been optimized. And what we could do is we could find which movie IDs have the highest numbers here and the lowest numbers. So I think this is gonna start lowest. And then we can print out, we can look inside our data loaders and grab the names of those movies for each of those five lowest numbers. And what's happened here? Well, we can see broadly speaking that it has printed out some pretty crappy movies. And why is that? Well, that's because when it does that matrix product that we saw in the Excel spreadsheet last week, it's trying to figure out who's gonna like what movie based on previous movies people have enjoyed or not. And then it adds movie bias, which can be positive or negative. That's a different number for each movie. So in order to do a good job of predicting whether you're gonna like a movie or not, it has to know which movies are crap. And so the crap movies are gonna end up with a very low movie bias parameter. And so we can actually find out which movies to people, not only which movies to people really not like, but which movies to people like, like less than one would expect given the kind of movie that it is. So Lawnmower Man 2, for example, not only apparently is it a crappy movie, but based on the kind of movie it is, you know, it's kind of like a high tech pop kind of sci-fi movie. People who like those kinds of movies still don't like Lawnmower Man 2. So that's what this is meaning. So it's kind of nice that we can like use a model, not just to predict things, but to understand things about the data. So if we saw it by descending, it'll give us the exact opposite. So here are movies that people enjoy even when they don't normally enjoy that kind of movie. So for example, LA Confidential, classic kind of film noir detective movie with the Aussie Guy Pearce. Even if you don't really like film noir detective movies, you might like this one. You know, Silence of the Lambs, classic kind of, I guess you'd say like horror, kind of not horror, it's a suspense movie. Even people who don't normally like kind of serial killer suspense movies tend to like this one. Now, the other thing we can do is not just look at what's happening in the bias. Oh, and by the way, we could do the same thing with users and find out like which user just loves movies, even the crappy ones, you know, just likes all movies and vice versa. But what about the other thing? We didn't just have bias, we also had movie factors, which has got the number of movies as one axis and the number of factors as the other and we passed in 50. What's in that huge matrix? Well, pretty hard to visualize such a huge matrix and we're not gonna talk about the details, but you can do something called PCA, which stands for Principle Component Analysis and that basically tries to compress those 50 columns down into three columns. And then we can draw a chart of the top two. And so this is PCA component number one and this is PCA component number two and here's a bunch of movies and this is a compressed view of these latent factors that it created. And you can see that they obviously have some kind of meaning, right? So over here towards the right, we've got kind of, you know, very pop mainstream kind of movies. And over here on the left, we've got more of the kind of critically acclaimed gritty kind of movies. And then towards the top, we've got very kind of action-oriented and sci-fi movies and then down towards the bottom, we've got very dialogue-driven movies. So remember, we didn't program in any of these things and we don't have any data at all about what movie is what kind of movie. But thanks to the magic of SGD, we just told it to please try and optimize these parameters. And the way it was able to predict who would like what movie was, it had to figure out what kinds of movies are there or what kind of taste is there for each movie. So I think that's pretty interesting. So this is called visualizing embeddings. And then this is visualizing the bias. We obviously would rather not do everything by hand like this or even like this. And FastAI provides an application for Collaborative Learner. And so we can create one. And this is gonna look much the same as what we just had. We're gonna say how many latent factors we want and what the Y range is to do the sigmoid and the multiply, and then we can do fit and away it goes. So let's see how it does. All right, so it's done a bit better than our manual one. Let's take a look at the model it created. The model looks very similar to what we created in terms of the parameters. You can see these are the two embeddings and these are the two biases. And we can do exactly the same thing. We can look in that model and we can find the... You'll see it's not called movies, it's I for items. It's users and items. This is the item bias. So we can look at the item bias, grab the weights. So what? And we get a very similar result. In this case, it's very, even more confident that LA Confidential is a movie that you should probably try watching even if you don't like those kind of movies. And Titanic's right up there as well, even if you don't really like romance-y kind of movies, you might like this one. Even if you don't like classic detective, you might like this one. We can have a look at the source code for colab learner. And we can see that... Let's see, use nn is false by default. So our model is gonna be of this type, embedding.bias. So we can take a look at that. Here it is. And look, this does look very similar. Okay, it's creating an embedding using the size we requested for each of users by factors and items by factors and users and items. And then it's grabbing each thing from the embedding in the forward, and it's doing the model play. And it's adding it up, and it's doing the sigmoid. So yeah, it looks exactly the same. Isn't that neat? So you can see that what's actually happening in real models is not... Yeah, it's not that weird or magic. So Curian is asking, is PCA useful in any other areas? And the answer is absolutely. And what I suggest you do, if you're interested, is check out our computational linear algebra course. It's five years old now, but I mean, this is stuff which hasn't changed for decades really. And this will teach you all about things like PCA and stuff like that. It's not nearly as directly practical as practical deep learning for coders, but it's definitely very interesting and it's the kind of thing, which if you wanna go deeper, it can become pretty useful later along your path. Okay, so here's something else interesting we can do. Let's grab the movie factors. So that's in our model, it's the item weights and it's the weight attribute that PyTorch creates. Okay, and now we can convert the movie Silence of the Lambs into its class ID. And we can do that with object to ID, OTI, for the titles. And so that's the movie index of Silence of the Lambs. And what we can do now is we can look through all of the movies in our latent factors and calculate how far apart the each vector is, each embedding vector is from this one. And this cosine similarity is very similar to basically the Euclidean distance, the kind of the root sum squared of the differences, but it normalizes it. So it's basically the angle between the vectors. So this is gonna calculate how similar each movie is to the Silence of the Lambs based on these latent factors. And so then we can find which ID is the closest. Yeah, so based on this embedding distance, the closest is dial M for murder, which makes a lot of sense. I'm not gonna discuss it today, but in the book, there's also some discussion about what's called the bootstrapping problem, which is the question of like, if you've got a new company or a new product, how would you get started with making recommendations given that you don't have any previous history with which to make recommendations? And that's a very interesting problem that you can read about in the book. Now, that's one way to do collaborative filtering, which is where we create that, through that matrix completion exercise, using all those top products. There's a different way, however, which is we can use deep learning. And to do it with deep learning, what we could do is we could basically create our user and item embeddings as per usual. And then we could create a sequential model. The sequential model is just layers of a deep learning neural network in order. And what we could do is we could just concatenate. So in forward, we could just concatenate the user and item embeddings together and then do a value. So this is basically a single hidden layer neural network and then a linear layer at the end to create a single output. So this is a very, you know, world's most simple neural net, exactly the same as the style that we created back here in our neural net from scratch. This is exactly the same. But we're using PyTorch's functionality to do it more easily. So in the forward here, we're gonna say exactly the same way as we have before. We'll look up the user embeddings and we'll look up the item embeddings. And then this is new. This is where we concatenate those two things together and put it through our neural network and then finally do our sigmoid. Now, one thing different this time is that we're going to ask FastAI to figure out how big our embeddings should be. And so FastAI has something called get embedding sizes and it just uses a rule of thumb that says for 944 users, we recommend 74 factor embeddings and for 1665 movies, or is it the other way around? I can't remember. We recommend 102 factors, your embeddings. So that's what those sizes are. So now we can create that model and we can pop it into a learner and fit in the usual way. And so rather than doing all that from scratch, what you can do is you can do exactly the same thing that we've done before, which is to call collaborative learner, but you can pass in the parameter use neural network equals true. And you can then say how big do you want each layer. So this is gonna create a two hidden layer, deep learning neural net. The first will have 1500 and the second will have 50 and then you can say fit and away it goes. Okay, so here is our, we've got 0.87. So these are doing less well than our dot product version, which is not too surprising because kind of the dot product version is really trying to take advantage of our understanding of the problem domain. In practice nowadays, a lot of companies kind of combine, they kind of create a combined model that has a dot product component and also has a neural net component. The neural net components particularly helpful if you've got metadata, for example, information about your users, like when did they sign up, how old are they, what sex are they, where are they from. And then those are all things that you could concatenate in with your embeddings and ditto with metadata about the movie, how old is it, what genre is it and so forth. All right, so we've got a question from Jonah, which I think is interesting. And the question is, is there an issue where the bias components are overwhelmingly determined by the non-experts in a genre? In general, actually there's a more general issue which is in collaborative filtering recommendation systems, very often a small number of users or a small number of movies overwhelm everybody else. And the classic one is anime. A relatively small number of people watch anime and those groups of people watch a lot of anime. So in movie recommendations, like there's a classic problem which is every time people try to make a list of well-loved movies, all the top ones seem to be anime. And so you can imagine what's happening in the Matrix completion exercise is that there are, yeah, some users that just really watch this one genre of movie and they watch an awful lot of them. So in general, you've actually do have to be pretty careful about these subtlety kind of issues. And yeah, I won't go into details about how to deal with them but they're generally involved kind of taking various kinds of ratios or normalizing things or so forth. All right, so that's collaborative filtering. And I wanted to show you something interesting then about embeddings, which is that embeddings are not just for collaborative filtering. And in fact, if you've heard about embeddings before, you've probably heard about them in the context of natural language processing. So you might have been wondering back when we did the Hugging Face Transformers stuff, how did we go about, you know, using text as inputs to models? And we talked about how you can turn words into integers. We make a list. So here's the movie, sorry, movie. Here's the poem, I Am Sam. I am Daniel, I am Sam, Sam, I am, that's Sam, I am, et cetera, et cetera. We can find a list of all the unique words in that poem and make this list here. And then we can give each of those words a unique ID just arbitrarily. Well, actually in this case, it's alphabetical order but it doesn't have to be. And so we kind of talked about that and that's what we do with categories in general. But how do we turn those into like, you know, lists of random numbers? And you might not be surprised to hear what we do is we create an embedding matrix. So here's an embedding matrix containing four latent factors for each word in the vocab. So here's each word in the vocab and here's the embedding matrix. So if we then want to present this poem to a neural net, then what we do is we list out our poem. I do not like that Sam, I am, do you like Green Anx and Ham, et cetera. Then for each word, we look it up. So in Excel, for example, we use match. So that will find this word over here and find it is word ID eight. And then we will find the eighth word and the first embedding. And so that's gives us, that's not right. Eight. Oh no, that is right. Sorry, here it is. It's just weird column. So that's gonna be 0.22, then 0.1, 0.01. And here it is 0.22, 0.1, 0.01, et cetera. So this is the embedding matrix we end up with for this poem. And so if you wanted to train or use and train neural network on this poem, you basically turn it into this matrix of numbers. And so this is what an embedding matrix looks like in an NLP model. And it works exactly the same way, as you can see. And then you can do exactly the same things in terms of interpretation of an NLP model by looking at both the bias factors and the latent factors in a word embedding matrix. So hopefully you're getting the idea here that our different models, the inputs to them, they're based on a relatively small number of kind of basic principles. And these principles are generally things like look up something in array. And then we know inside the model we're basically multiplying things together, adding them up and replacing the negatives of zeros. So hopefully you're getting the idea that what's going on inside a neural network is generally not that complicated. But it happens very quickly and at scale. Now, it's not just collaborative filtering and NLP, but also tabular analysis. So in chapter nine of the book, we've talked about how random forests can be used for this, which was for this is for the thing where we're predicting the auction sale price of industrial heavy equipment like bulldozers. Instead of using a random forest, we can use a neural net. Now, in this dataset, there are some continuous columns and there are some categorical columns. Now, I'm not gonna go into the details too much, but in short, we can separate out the continuous columns and categorical columns using cont cat split. And that will automatically find which is which based on their data types. And so in this case, it looks like, okay, so continuous columns, the elapsed sale date. So I think that's the number of seconds or years or something since the start of the dataset is a continuous variable. And then here are the categorical variables. So for example, there are six different product sizes and two coupler systems, 5,059 model description, six enclosures, 17 tire sizes and so forth. So we can use fast AI basically to say, okay, we'll take that data frame and pass in the categorical and continuous variables and create some random splits. And what's the dependent variable? And we can create data loaders from that. And from that, we can create a tabular learner. And basically what that's gonna do is it's gonna create a pretty regular multi-layer neural network, not that different to this one that we created by hand. And each of the categorical variables, it's gonna create an embedding for it. And so I can actually show you this, right? So we're gonna use tabular learner to create the learner. And so tabular learner is one, two, three, four, five, six, seven, eight, nine lines of code. And basically the main thing it does is create a tabular model. And so then tabular model, you're not gonna understand all of it, but you might be surprised at how much. So a tabular model is a module. We're gonna be passing in how big is each embedding gonna be. And tabular learner, what's that passing in? It's gonna call get embedding sizes, just like we did manually before, automatically. So that's how it gets its embedding sizes. And then it's going to create an embedding for each of those embedding sizes, from number of inputs to number of factors. Dropout we're gonna come back to later, batch norm we won't do till part two. So then it's gonna create a layer for each of the layers we want, which is gonna contain a linear layer followed by batch norm followed by dropout. It's gonna add the sigmoid range we've talked about at the very end. And so the forward, this is the entire thing. If there's some embeddings, it'll go through and get each of the embeddings using the same indexing approach we've used before. It'll concatenate them all together. And then it'll run it through the layers of the neural net, which are these. So yeah, we don't know all of those details yet, but we know quite a few of them. So that's encouraging, hopefully. And once we've got that, we can do the standard LR find and fit. Now, this exact dataset was used in a Kaggle competition. This dataset was in a Kaggle competition. And the third place getter published a paper about the technique. And it's basically the exact, almost the exact one I'm showing you here. So it wasn't this dataset, it was a dataset, it was a different one. It was about predicting the amount of sales in different stores. But they used this basic kind of technique. And one of the interesting things is that they used a lot less manual feature engineering than the other high-placed entries. Like they had a much simpler approach. And one of the interesting things, they published a paper about their approach. So they published a paper about their approach. So this is the team from this company. And they basically describe here exactly what I just showed you, these different embedding layers being concatenated together and then going through a couple of layers of a neural network. And it's showing here, it points out in the paper exactly what we learned in the last lesson, which is embedding layers are exactly equivalent to linear layers on top of a one hot encoded input. And yeah, they found that their technique worked really well. One of the interesting things they also showed is that you can take, you can create your neural net, get your trained embeddings and then you can put those embeddings into a random forest or gradient booster tree. And your main average percent error will dramatically improve. So you can actually combine random forests and embeddings or gradient booster trees and embeddings, which is really interesting. Now, what I really wanted to show you though is what they then did. So as I said, this was a thing about the predicted amount that different products would sell for different shops around Germany. And what they did was they had a, so one of their embedding matrices was embeddings by region. And then they did, I think this is a PCA principle component analysis of the embeddings for their German regions. And when they create a chart of them, you can see that the locations that are close together in the embedding matrix are the same locations that are close together in Germany. So you can see here's the blue ones and here's the blue ones. And again, it's important to recognize that the data that they used had no information about the location of these places. The fact that they are close together geographically is something that was figured out as being something that actually helped it to predict sales. And so in fact, they then did a plot showing each of these dots is a shop, a store. And it's showing for each pair of stores, how far away is it in real life, in metric space? And then how far away is it in embedding space? And there's this very strong correlation, right? So it's kind of reconstructed somehow, this kind of the kind of the geography of Germany by figuring out how people shop and similar for days of the week. So there was no information really about days of the week but when they put it on the embedding matrix, the days of the week, Monday, Tuesday, Wednesday, close to each other, Thursday, Friday, close to each other. As you can see, Saturday and Sunday close to each other and ditto for months of the year, January, February, March, April, May, June. So yeah, really interesting, cool stuff, I think. What's actually going on inside a neural network. All right, let's take a 10 minute break and I will see you back here at 7.10. All right, folks, this is something I think is really fun. We've looked at what goes into the start of a model, the input. We've learned about how they can be categories or embeddings and embeddings are basically kind of one hot encoded categories with a little compute trick or they can just be continuous numbers. We've learned about what comes out the other side which is a bunch of activations, so just a bunch of tensor of numbers which we can use things like softmax to constrain them to add up to one and so forth. And we've looked at what can go in the middle which is the matrix model plays, sandwiched together as rectified linear units. And I mentioned that there are other things that can go in the middle as well but we haven't really talked about what those other things are. So I thought we might look at one of the most important and interesting version of things that can go in the middle but what you'll see is it turns out it's actually just another kind of matrix multiplication which might not be obvious at first but I'll explain. We're gonna look at something called a convolution and convolutions are at the heart of a convolutional neural network. So the first thing to realize is a convolutional neural network is very, very, very similar to the neural networks we've seen so far. It's got imports, it's got things that are a lot like or actually are a form of matrix multiplication, sandwiched with activation functions which can be rectified linear but there's a particular thing which makes them very useful for computer vision. And I'm gonna show you using this Excel spreadsheet that's in our repo called convxample. And we're gonna look at using an image from MNIST so MNIST is kind of the world's most famous computer vision data set I think because it was like the first one really which really showed image recognition being cracked. It's pretty small by today's standards. It's a data set of handwritten digits. Each one is 28 by 28 pixels but yeah, you know, back in the mid 90s Jan Le Coon showed really practically useful performance on this data set and as a result ended up with convnets being used in the American banking system for reading checks. So here's an example of one of those digits. This is a seven that somebody drew it's one of those ones with a stroke through it and this is what it looks like. This is the image. And so I got it from MNIST. This is just one of the images from MNIST which I put into Excel. And what you see in the next column is a version of the image where the horizontal lines are being recognized and another one where the vertical lines are being recognized. And if you think back to that Zyla and Fergus paper that talked about what the layers of a neural net does this is absolutely an example of something that we know that the first layer of a neural network tends to learn how to do. Now, how did I do this? I did this using something called a convolution. And so what we're gonna do now is we're gonna zoom in to this Excel notebook. We're gonna keep zooming in. We're gonna keep zooming in. So take a look, keep a lay on this image and you'll see that once we zoom in enough it's actually just made of numbers which as we discussed in the very first, in the very first lesson we saw how images are made of numbers. So here they are, right? Here are the numbers between zero and one. And what I just did is I just used a little trick. I used Microsoft Excel's conditional formatting to basically make things the higher numbers more read. So that's how I turned this Excel sheet and I've just rounded it off to the nearest decimal but it's actually, they're actually bigger than that. And so, yeah, so here is the image as numbers. And so let me show you how we went about creating this top edge detector. What we did was we created this formula. Don't worry about the max. Let's focus on this. What it's doing is have a look at the colored in areas. It's taking each of these cells and multiplying them by each of these cells and then adding them up. And then we do the rectified linear part which is if that ends up less than zero, then make it zero. So this is like a rectified linear unit but it's not doing the normal matrix product. It's doing the equivalent of a dot product but just on these nine cells and with just these nine weights. So you might not be surprised to hear that if I move now one to the right, then now it's using the next nine cells. So if I move like to the right quite a bit and down quite a bit to here, it's using these nine cells. So it's still doing a dot product which as we know is a form of matrix multiplication but it's doing it in this way where it's kind of taking advantage of the geometry of the situation that the things that are close to each other are being multiplied by this consistent group of the same nine weights each time. Because there's actually 28 by 28 numbers here, right? Which I think is 768, 28 times 28, that's close enough, 784. But we don't have 784 parameters, we only have nine parameters. And so this is called a convolution. So a convolution is where you basically slide this kind of little three by three matrix across a bigger matrix. And at each location, you do a dot product of the corresponding elements of that three by three with the corresponding elements of this three by three matrix of coefficients. Now, why does that create something that finds as you see top edges? Well, it's because of the particular way I constructed this three by three matrix. What I said was that all of the rows just above, so these ones are gonna get a one and all of the ones just below are gonna get a minus one and all of the ones in the middle are gonna get a zero. So let's think about what happens somewhere like here. Right, that is, let's try to find the right one. Here it is. So here, we're gonna get one times one plus one times one plus one times one minus one times one minus one times one minus one times one, we're gonna get zero. But what about up here? Here, we're gonna get one times one plus one times one plus one times one. These do nothing because they're times zero minus one times zero. So we're gonna get three. So we're only gonna get three, the highest possible number, in the situation where these are all as black as possible, or in this case, as red as possible, and these are all white. And so that's only gonna happen at a horizontal edge. So the one underneath it does exactly the same thing, exactly the same formulas. Oopsie-daisy. The one underneath are exactly the same formulas, a three by three sliding thing here, but this time we've got a different little mini matrix of coefficients, which is all ones going down and all minus ones going down. And so for exactly the same reason, this will only be three in situations where they're all one here and they're all zero here. So you can think of a convolution as being a sliding window of little mini dot products of these little three by three matrices. And they don't have to be three by three, right? You could have, we could just have easily done five by five and then we'd have a five by five matrix of coefficients or whatever, whatever size you like. So the size of this is called its kernel size. This is a three by three kernel for this convolution. So then, because this is deep learning, we just repeat these steps again and again and again. So this layer I'm calling conv one, it's the first convolutional layer. So conv two, it's gonna be a little bit different because on conv one, we only had a single channel input. It's just black and white or, you know, yeah, black and white, gray scale, one channel. But now we've got two channels. We've got the, let's make it a little smaller so we can see better. We've got the horizontal edges channel and the vertical edges channel. And we'd have a similar thing in the first layer of its color, we'd have a red channel, a green channel and a blue channel. So now our filter, this is called the filter, this little mini matrix is called the filter. Our filter, our filter now contains a three by three by depth two. Or if you give one a thing of another way, two, three by three kernels or one, three by three by two kernel. And we basically do exactly the same thing, which is we're gonna multiply each of these by each of these and sum them up. But then we do it for the second bit as well. We multiply each of these by each of these and sum them up. And so that gives us, and I think I just picked some random numbers here, right? So this is gonna now be something which can combine. Oh, sorry, the second one, the second set. So it's sorry, each of the red ones by each of the blue ones, that's here. Plus each of the green ones times each of the mauve ones. That's here. So this first filter is being applied to the horizontal edge detector. And the second filter is being applied to the vertical edge detector. And as a result, we can end up with something that combines features of the two things. And so then we can have a second channel over here, which is just a different bunch of convolutions for each of the two channels, this one times this one. Again, you can see the colors. So what we could do is if, once we kind of get to the end, we'll end up as I'll show you how in a moment, we'll end up with a single set of 10 activations, one per digit we're recognizing, zero to nine. Or in this case, I think we could just create one, you know, maybe we're just trying to recognize nothing but the number seven or not the number seven. So we could just have one activation. And then we would back propagate through this using SGD in the usual way. And that is going to end up optimizing these numbers. So in this case, I manually put in the numbers I knew would create edge detectors. In real life, you start with random numbers and then you use SGD to optimize these parameters. Okay, so there's a few things we can do next. And I'm going to show you the way that was more common a few years ago. And then I'll explain some changes that have been made more recently. What happened a few years ago was we would then take these activations which as you can see, these activations now are kind of in a grid pattern. And we would do something called max pooling. And max pooling is kind of like a convolution. It's a sliding window. But this time as the sliding window goes across, so here we're up to here, we don't do a dot product over a filter. But instead, we just take a maximum. See here, just this is the maximum of these four numbers. And if we go across a little bit, this is the maximum of these four numbers. Go across a bit, go across a bit, and so forth. Oh, that goes off the edge. And you can see what happens when this is called a two by two max pooling. So you can see what happens with a two by two max pooling. We end up losing half of our activations on each dimension. So we're gonna end up with only one quarter of the number of activations we used to have. And that's actually a good thing because if we keep on doing convolution, max pool, convolution, max pool, we're gonna get fewer and fewer and fewer activations until eventually we'll just have one left. Which is what we want. That's effectively what we used to do. But the other thing I mentioned is we didn't normally keep going until there's only one left. What we used to then do is we'd basically say, okay, at some point, we're gonna take all of the activations that are left and we're gonna basically just do a dot product of those with a bunch of coefficients, not as a convolution, but just as a normal linear layer. And this is called the dense layer. And then we would add them all up. So we basically end up with our final big dot product of all of the max pooled activations by all of the weights. And we'd do that for each channel. And so that would give us our final activation. And as I say here, MNIST would actually have 10 activations. So you'd have a separate set of weights for each of the digits you're predicting and then softmax after that. Okay, nowadays we do things very slightly differently. Nowadays we normally don't have max pool layers. But instead, what we normally do is when we do our sliding window, like this one here, we don't normally, let's go back to C. So when I go one to the right, so currently we're starting in cell column G. If I go one to the right, the next one is column H. And if I go one to the right, the next one starts in column I. So you can see it's sliding the window every every every three by three. Nowadays what we tend to do instead is we generally skip one. So we would normally only look at every second. So we would after doing column I, we would skip columns J and would go straight to column K. And that's called a stride two convolution. We do that both across the rows and down the columns. And what that means is every time we do a convolution, we reduce our effective kind of feature size, grid size by two on each axis. So it reduces it by four in total. So that's basically instead of doing max pooling. And then the other thing that we do differently is nowadays we don't normally have a single dense layer at the end, a single matrix multiplier at the end. But instead what we do, we generally keep doing stride two convolutions. So each one's gonna reduce the grid size by two by two. We keep going down until we've got about a seven by seven grid. And then we do a single pooling at the end. And we don't normally do max pool nowadays. Instead we do an average pool. So we average the activations of each one of the seven by seven features. This is actually quite important to know because if you think about what that means, it means that something like an image net style image detector is gonna end up with a seven by seven grid. And let's try to say, is this a bear? And in each of the parts of the seven by seven grid, it's basically saying, is there a bear in this part of the photo? Is there a bear in this part of the photo? Is there a bear in this part of the photo? And then to take the average of those 49, seven by seven predictions to decide whether there's a bear in the photo. That works very well if it's basically a photo of a bear, right? Because most, you know, if the bear is big and takes up most of the frame, then most of those seven by seven bits are bits of a bear. On the other hand, if it's a teeny tiny bear in the corner, then potentially only one of those 49 squares has a bear in it. And even worse, if it's like a picture of lots and lots of different things, only one of which is a bear, it could end up not being a great bear detector. And so this is where like the details of how we construct our model turn out to be important. And so if you're trying to find like just one part of a photo that has a small bear in it, you might decide to use maximum pooling instead of average pooling. Because max pooling will just say, I think this is a picture of a bear. If any one of those 49 bits of my grid has something that looks like a bear in it. So these are potentially important details which often get hand-waved over. Although, again, like the key thing here is that this is happening right at the very end, right? That max pool or that average pool. And actually fast AI handles this for you. We do a special thing which we kind of independently invented. I think we did it first, which is we do both max pool and average pool and we can catalyze them together. And we call that concat pooling. And that has since been reinvented in at least one paper. And so that means that you don't have to think too much about it because we're gonna try both for you basically. So I mentioned that this is actually really just matrix multiplication. And to show you that, I'm gonna show you some images created by a guy called Matthew Kleinsmith who did this actually. I think this is in our very first ever course, might've been the part two, first part two course. And he basically pointed out that in a certain way of thinking about it, it turns out that convolution is the same thing as a matrix model player. So I wanna show you how he shows this. He basically says, okay, let's take this three by three image and a two by two kernel containing the coefficients alpha, beta, gamma, delta. And so in this, as we slide the window over, each of the colors, each of the colors are multiplied together, red by red, plus green by green, plus what is that, orange by orange, plus blue by blue gives you this. And so to put it another way, algebraically P equals alpha times A plus beta times B, et cetera. And so then as we slide to this part, we're model playing again, red by red, green by green, so forth. So we can say Q equals alpha times B plus beta times C, et cetera. And so this is how we calculate a convolution using the approach we just described as a sliding window. But here's another way of thinking about it. We could say, okay, we've got all these different things, A, B, C, D, E, F, T, H, J. Let's put them all into a single vector. And then let's create a single matrix that has alpha, alpha, alpha, alpha, beta, beta, beta, beta, et cetera. And then if we do this matrix multiplied by this vector, we get this with these gray zeros in the appropriate places, which gives us this, which is the same as this. And so this shows that a convolution is actually a special kind of matrix multiplication. It's a matrix multiplication where there are some zeros that are fixed and some numbers that are forced to be the same. Now in practice, it's gonna be faster to do it this way, but it's a useful kind of thing to think about, I think, that just to realize like, oh, it's just another of these special types of matrix multiplications. Okay, I think, well, let's look at one more thing because there was one other thing that we saw and I mentioned we would look at in the tabular model, which is called dropout. And I actually have this in my Excel spreadsheet. If you go to the conv example dropout page, you'll see we've actually got a little bit more stuff here. We've got the same input as before and the same first convolution as before and the same second convolution as before. And then we've got a bunch of random numbers. They're showing as between zero and one, but they're actually, that's just because they're rounding off, they're actually random numbers that are floats between zero and one. Over here, we're then saying if, so way up here, I'll zoom in a bit, I've got a dropout factor. Let's change this to 0.5, there we go. So over here, this is something that says if the random number in the equivalent place is greater than 0.5, then one, otherwise zero. And so here's a whole bunch of ones and zeros. Now this thing here is called a dropout mask. Now what happens is we multiply over here, we multiply the dropout mask and we multiply it by our filtered image. And what that means is we end up with exactly the same image we started with. Here's the image we started with, but it's corrupted. Random bits of it have been deleted. And based on the amount of dropout we use, so if we change it to say 0.2, not very much of it's deleted at all, so it's still very easy to recognize. Well, so if we use lots of dropouts, say 0.8, it's almost impossible to see what the number was. And then we use this as the input to the next layer. So that seems weird. Why would we delete some data at random from our processed image from our activations after a layer of the convolutions? Well, the reason is that a human is able to look at this corrupted image and still recognize it's a seven. And the idea is that a computer should be able to as well. And if we randomly delete different bits of the activations each time, then the computer is forced to learn the underlying real representation rather than overfitting. You can think of this as data augmentation, but it's data augmentation not for the inputs, but data augmentation for the activations. So this is called a dropout layer. And so dropout layers are really helpful for avoiding overfitting. And you can decide how much you want to compromise between good generalization, so avoiding overfitting, versus getting something that works really well on the training data. And so the more dropout you use, the less good it's going to be on the training data, but the better it ought to generalize. And so this comes from a paper by Jeffrey Hinton's group quite a few years ago now. Ruslan's now at Apple, I think. And then Kagesky and Hinton went on to found Google Brain. And you can see here, they've got this picture of a fully connected neural network, two layers just like the one we built. And here, look, they're kind of randomly deleting some of the activations. And all that's left is these connections. And so that's a different bunch that's going to be deleted each batch. I thought this was an interesting point. So dropout, which is super important, was actually developed in a master's thesis, and it was rejected from the main neural networks conference, then called NIPS, now called NeurIPS. So it ended up being disseminated through Archive, which is a pre-print server. And yes, it's just been pointed out on our chat that Ilya was one of the founders of OpenAI. I don't know what happened to Nitish. I think he went to Google Brain as well, maybe. Yeah, so peer review is a very fallible thing in both directions. And it's great that we have pre-print servers so we can read stuff like this, even if reviewers decide it's not worthy. It's been one of the most important papers ever. Okay, now, I think that's given us a good tour now. We've really seen quite a few ways of dealing with input to a neural network, quite a few of the things that can happen in the middle of a neural network. We've only talked about rectified linear units, which is this one here, zero, if x is less than zero or x otherwise. These are some of the other activations you can use. Don't use this one, of course, because you end up with a linear model, but they're all just different functions. I should mention, it turns out these don't matter very much. Basically, pretty much any non-linearity works fine. So we don't spend much time talking about activation functions, even in part two of the course, just a little bit. So yeah, so we understand there's our inputs. They can be one-hot encoded or embeddings, which is a computational shortcut. There are sandwiched layers of matrix multiplies and activation functions. The matrix multiplies can sometimes be special cases, such as the convolutions or the embeddings. The output can go through some tweaking, such as the softmax. And then, of course, you've got the loss function, such as cross entropy loss, or mean squared error, or mean absolute error. But it's not, there's nothing too crazy going on in there. So I feel like we've got a good sense now of what goes inside a wide range of neural nets. You're not gonna see anything too weird from here. And we've also seen a wide range of applications. So before you come back to do part two, you know, what now? And we're gonna have a little AMA session here. And in fact, one of the questions was what now? So this is quite good. One thing I strongly suggest is if you've got this far, it's probably worth you investing your time in reading Radix book, which is meta learning. And so meta learning is very heavily based on the kind of teachings of fast AI over the last few years, and is all about how to learn deep learning and learn pretty much anything. Yeah, because, you know, you've got to this point, you may as well know how to get to the next point as well as possible. And the main thing you'll see that Radix talks about, or one of the main things, is practicing and writing. So if you've kind of zipped through the videos on, you know, 2x and haven't done any exercises, you know, go back and watch the videos again. You know, a lot of the best students end up watching them two or three times, probably more like three times, and actually go through and code as you watch, you know, and experiment. You know, write posts, blog posts about what you're doing, spend time on the forum, both helping others and seeing other people's answers to questions. Read the success stories on the forum and of people's projects to get inspiration for things you could try. One of the most important things to do is to get together with other people. For example, you couldn't do, you know, a Zoom study group. In fact, on our Discord, which you can find through our forum, there's always study groups going on, or you can create your own. You know, a study group to go through the book together. Yeah, and of course, you know, build stuff. And sometimes it's tricky to always be able to build stuff for work, because maybe you're not quite in the right area, or they're not quite ready to try out deep learning yet, but that's okay, you know, build some hobby projects, build some stuff just for fun, or build some stuff that you're passionate about. Yeah, so it's really important to not just put the videos away and go away and do something else, because you'll forget everything you've learned and you won't have practiced. So one of our community members went on to create an activation function, for example, which is Mish, which is now, as Tanishka has just reminded me on our forums, is now used in many of the state-of-the-art networks around the world, which is pretty cool. And he's now at Mila, I think, a research, one of the top research labs in the world. I wonder how that's doing. Let's have a look. We've got a Google Scholar. Nice, 486 citations. They're doing great. All right, let's have a look at how our AMA topic is going and pick out some of the highest ranked AMAs. Okay, so the first one is from Lucas, and actually maybe I should... Actually, let's switch our view here. So our first AMA is from Lucas, and Lucas asks, how do you stay motivated? I often find myself overwhelmed in this field. There are so many new things coming up that I feel like I have to put so much energy just to keep my head above the waterline. Yeah, that's a very interesting question. I mean, I think Lucas, the important thing is to realize you don't have to know everything. In fact, nobody knows everything, and that's okay. What people do is they take an interest in some area, and they follow that, and they try and do the best job they can of keeping up with some little sub-area. And if your little sub-area is too much to keep up on, pick a sub-sub-area. Yeah, there's nothing like... There's no need for it to be demotivating that there's a lot of people doing a lot of interesting work and a lot of different sub-fields. That's cool. It used to be kind of dull, but then there's only basically five labs in the world working on neural nets. And yeah, from time to time, take a dip into other areas that maybe you're not following us closely. But when you're just starting out, you'll find that things are not changing that fast at all, really. They can kind of look that way because people are always putting out press releases about their new tweaks. But fundamentally, the stuff that is in the course now is not that different to what was in the course five years ago. The foundations haven't changed. And it's not that different, in fact, to the convolutional neural network that Yan Likun used on MNIST back in 1996. The basic ideas I've described are forever. The way the inputs work and the sandwiches of matrix multipliers and activation functions and the stuff you do to the final layer. Everything else is tweaks. And the more you learn about those basic ideas, the more you'll recognize those tweaks as simple little tweaks that you'll be able to quickly get your hat around. So then Lucas goes on to ask or to comment, another thing that constantly bothers me is I feel the field is getting more and more skewed towards bigger and more computationally expensive models and huge amounts of data. I keep wondering if in some years from now I would still be able to train reasonable models with a single GPU or if everything is going to require a compute cluster. Yeah, that's a great question. I get that a lot. But interestingly, I've been teaching people machine learning and data science stuff for nearly 30 years. And I've had a variation of this question throughout. And the reason is that engineers always want to push the envelope on the biggest computers they can find. That's just this like fun thing engineers love to do. And by definition, they're going to get slightly better results than people doing exactly the same thing on smaller computers. So it always looks like, oh, you need big computers to be state of the art. But that's actually never true, right? Because there's always smarter ways to do things, not just bigger ways to do things. And so, you know, when you look at Fast AI's Dawnbench success, when we trained ImageNet faster than anybody had trained it before, on standard GPUs, me and a bunch of students, that was not meant to happen. Google was working very hard with their TPU introduction to try to show how good they were until I was using like 256 PCs in parallel or something. But yeah, you know, we used common sense and smarts and showed what can be done. You know, it's also a case of picking the problems you solve. So I would not be probably doing like going head to head up against Codex and trying to create code from English descriptions. You know, because that's a problem that does probably require very large neural nets and very large amounts of data. But if you pick areas in different domains, you know, there's still huge areas where much smaller models are still going to be state of the art. So hopefully that helped answer your question. Let's see what else we've got here. So Daniel has obviously been following my journey with teaching my daughter Math. So I homeschool my daughter and Daniel asks how do you homeschool young children, science in general and math in particular? Would you share your experiences by blogging or in lectures someday? Yeah, I could do that. So I actually spent quite a few months just working on my computer and quite a few months just reading research papers about education recently. So I do probably have a lot I probably need to talk about at some stage. But yeah, broadly speaking, I lean into using computers and tablets a lot more than most people because actually there's an awful lot of really great apps that are super compelling. They're adaptive so they go at the right speed for the student and they're fun. And I really like my daughter to have fun. I really don't like to force her to do things. For example, there's a really cool app called Dragonbox Algebra 5 Plus which teaches algebra to five year olds by using a really fun computer game involving helping dragon eggs to hatch. And it turns out that yeah, the basic ideas of algebra are no more complex than the basic ideas that we do in other kindergarten math. And all the parents I know of who have given their kids Dragonbox Algebra 5 Plus, their kids have successfully learned algebra. So that would be an example. But yeah, we should talk about this more at some point. Let's see what else we've got here. So Farah says, the walkthroughs have been a game changer for me. The knowledge and tips you shared in those sessions, the skills required to become an effective machine learning practitioner and utilize fast AI more effectively. Have you considered making the walkthroughs a more formal part of the course, doing a separate software engineering course or continuing live coding sessions between part one and two? So yes, I am going to keep doing live coding sessions. At the moment we've switched to those specifically to focusing on APL. And then in a couple of weeks, they're going to be going to fast AI study groups. And then after that, they'll gradually turn back into more live coding sessions. But yeah, the thing I try to do in my live coding or study groups, whatever, is definitely try to show the foundational techniques that just make life easier as a coder or a data scientist. When I say foundational, I mean, yeah, the stuff which you can reuse again and again and again, like learning regular expressions really well or knowing how to use a VM or understanding how to use the terminal and command line, you know, all that kind of stuff. Never goes out of style, it never gets old. And yeah, I do plan to at some point hopefully actually do a course really all about that stuff specifically. But yeah, for now the best approach is follow along with the live coding and stuff. Okay, WG pubs which is Wade asks, how do you turn a model into a business? Specifically, how does a coder with little or no startup experience turn an ML based radio prototype into a legitimate business venture? Okay, I plan to do a course about this at some point as well. So, you know, obviously there isn't a two minute version to this, but the key thing with creating a legitimate business venture is to solve a legitimate problem, you know, a problem that people need solve, solving and which they will pay you to solve. And so it's important not to start with your fun radio prototype as a basis your business, but instead start with, here's a problem I want to solve. And generally speaking you should try to pick a problem that you understand better than most people. So it's either a problem that you face day to day in your work or in some hobby, your passion that you have or that, you know, your club has or your local school has or your spouse deals with in their workplace. You know, it's something where you understand that there's something that doesn't work as well as it ought to. Particularly something where you think yourself, you know, if they just used deep learning here or some algorithm here or some better compute here, that problem would go away. And that's the start of a business. And so then my friend Eric Reese wrote a book called The Lean Startup where he describes what you do next, which is basically you fake it. You create, so he calls it the minimum viable product. You create something that solves that problem that takes you as little time as possible to create. It could be very manual. It can be loss making. It's fine. You know, even the bit in the middle where you're like, oh, there's going to be a neural net here. It's fine to launch without the neural net and do everything by hand. You're just trying to find out are people going to pay for this and is this actually useful? And then once you have, you know, hopefully confirmed that the need is real and that people will pay for it and you can solve the need, you can gradually make it less and less of a fake, you know, and do, you know, more and more getting the product to where you want it to be. Okay. I don't know how to pronounce the name M-I-W-O-J-C. M-I-W-O-J-C says, Jeremy, can you share some of your productivity hacks from the content you produce? It may seem you worked 24 hours a day. Okay. I certainly don't do that. I think one of my main productivity hacks actually is not to work too hard. Or at least, no, not to work too hard. Not to work too much. I spend probably less hours a day working than most people, I would guess. But I think I do a couple of things differently when I'm working. One is I've spent half, at least half of every working day since I was about 18 learning or practicing something new. Could be a new language. Could be a new algorithm. Could be something I read about. And nearly all of that time therefore I've been doing that thing more slowly than I would if I just used something I already knew. Which often drives my coworkers crazy because they're like, you know, why aren't you focusing on getting that thing done? But in the other 50% of the time I'm constantly, you know, building up this kind of exponentially improving base of expertise in a wide range of areas. And so now I do find, you know, I can do things often orders of magnitude faster than people around me or certainly many multiples faster than people around me because I, you know, know tools and skills and ideas which yeah, no, other people don't necessarily know. So like I think that's one thing that's been helpful. And then another is, yeah, like trying to really not overdo things, like get good sleep and eat well and exercise well. And also I think it's a case of like tenacity, you know, I've noticed a lot of people give up much earlier than I do. So, yeah, if you if you just keep going until something's actually finished then that's going to put you in a small minority to be honest. Most people don't do that. And when I say finished, like finish something really nicely. And I try to make it like, so I'm particularly like coding and so I try to do a lot of coding related stuff. So I create things like NB Dev and NB Dev makes it much, much easier for me to finish something nicely. So in my kind of chosen area I've spent quite a bit of time trying to make sure it's really easy for me to like get out a blog post, get out a Python library, get out a notebook analysis, whatever. So yeah, trying to make these things I want to do easier and so then I'll do them more. So well, thank you everybody. That's been a lot of fun. Really appreciate you taking the time to go through this course with me. Yeah, if you enjoyed it it would really help if you would give a like on YouTube because it really helps other people find the course, goes into the YouTube recommendation system. And please do come and help other beginners on forums.fast.ai It's a great way to learn yourself is to try to teach other people. And yeah, I hope you'll join us in part two. Thanks everybody very much. I've really enjoyed this process and I hope to get to meet more of you in person in the future. Bye.