 So I I don't want to embarrass Rachel, but I'm very excited that Rachel's here. So this is Rachel for those of you that don't know She's not quite back on her feet after her illness, but well enough to at least come to at least part of this lesson So don't worry if she can't stay for the whole thing And I'm really glad she's here because Rachel actually wrote the vast majority of the lesson we're going to see It's I think it's a really really cool work. So I'm glad she's gonna at least see it being taught Even if unfortunately she's not teaching it herself so Good Thanksgiving present best Thanksgiving present So we're as we discussed at the end of last lesson we're kind of moving from the sea decision tree ensembles to to neural nets broadly defined and As we discussed, you know random forests and decision trees are limited By the fact in the end that they're basically They're basically doing nearest neighbors Right, you know that all they can do is to get return the average of a bunch of other points And so they can't extrapolate out to you know, if you're thinking what happens if I increase my prices by 20% And you've never priced at that level before or what's going to happen to sales next year? And obviously we've never seen next year before It's very hard to extrapolate. It's also hard if it needs to you know, like it's it can only do around log base two and Decisions, you know And so if there's like a time series it needs to fit to that takes like four steps to kind of get to the right time area Then suddenly there's not many decisions left for it to make so it's kind of this limited amount of computation that it can do So there's a limited Complexity of relationship that it can model Yes, Prince Can I ask about one more drawback of random forests that yeah, I feel so If we have a data as categorical variable, which are not in sequential order So for random forest we encode them and treat them as numbers. Let's say we have 20 cardinality and 1 to 20 So the result at random forest gives is like the split the tandem forest gives is something like less than five less than six But if the categories are not sequential not in any order What does that mean? Yeah, so So if you've got like Let's go back to bulldozers erupts Erups Erups with a C Orups and a I don't know whatever right and We arbitrarily label them like this Right And so actually we know that all that really mattered was if it had air conditioning So what's going to happen? Well, it's basically going to say like okay if I group it into those together and Those together that like that's an interesting break Just because it so happens that the air conditioning ones all are going to end up in the right-hand side and then having done that Right, it's then going to say okay. Well within the group With the two and three it's going to notice that it's further more going to have to split it into two more groups So eventually it's going to get there It's going to pull out that category. It's just it's going to take more splits than we would ideally like So it's kind of similar to the fact that for it to model a line It can only do it with lots of spits and only approximately Random forest is fine with categories that are not sequential also. Yeah, so I can do it It's just like in some way it's suboptimal because we just need to do More break points than we would have liked but it gets there it does a pretty good job And so even although random forests, you know, do have some deficiencies They're incredibly Powerful, you know, particularly because they have so few assumptions. They really had to screw up and you know It's kind of hard to actually win a Kaggle competition with a random forest But it's very easy to get like top 10% So in like in real life where often that third decimal place doesn't matter random forests often like what you end up doing But for some things like this Ecuadorian groceries Competition, it's very very hard to get a good result with a random forest Because like there's a huge time series component and like nearly everything is these two massively high cardinality categorical variables, which is the store and the item and like so this so there's very little there to even throw out a random forest and the You know the difference between every pair of stores Is kind of different in different ways and so, you know, there are some things that are just hard to get even Relatively good results with a random forest another example is Recognizing numbers You can get like okay results with a random forest But in the end the kind of the relationship between, you know, like the spatial structure Turns out to be important right and you kind of want to be able to do like computations like finding edges or whatever that kind of carry forward through through the computations so, you know just doing a Kind of a clever nearest neighbors like a random forest, you know turns out not to be ideal So for stuff like this neural networks turn out that they are ideal Neural networks turn out to be something that works particularly well for both things like the Ecuadorian groceries competition So forecasting sales over time by store and by item and for things like Recognizing digits and for things like turning voice into speech And so it's kind of nice between these two things neural nets and random forests We kind of cover the territory, right? I don't I haven't needed to use anything other than these two things for a very long time and We'll actually learn I don't know in what course exactly but at some point We'll learn also how to combine the two because you can combine the two in really cool ways So here's a picture from Adam Geithge Of an image so an image is just a bunch of numbers, right? And each of those numbers is not to 255 and the dark ones are too close to 255 light ones are close to zero All right, so here is an example of a digit From this MNIST data set MNIST is a really old. It's like a hello world of machine of neural networks And so here's an example and so there are 28 by 28 Pixels if it was color there would be three of these one for red one for green one for blue So our job is to look at you know the array of numbers and Figure out that this is the number eight Which is tricky, right? How do we do that? so We're going to use a few a small number of fast AI pieces and we're gradually going to remove more and more and more Until by the end we'll have implemented our own Neural network from scratch our own training loop from scratch and our own matrix multiplication from scratch So we're gradually going to dig in further and further All right, so the data for MNIST, which is the name of this very famous data set is available from here and We have a thing in fastai.io called get data Which will grab it from a URL and store it from your on your computer unless it's already there in which case it'll just go ahead and use it okay, and Then we've got a little function here called load MNIST, which simply loads it up you'll see That it's zipped so we could just use Python's gzip to open it up and then it's also pickled So if you have any kind of Python object at all You can use this built-in Python library called pickle to dump it out onto your disk Share it around load it up later, and you get back the same Python object you started with So you've already seen this something like this with like Pandas feather format right pickle is not just for pandas It's not just for anything. It works for basically nearly every Python object So which might lead to the question. Well, why didn't we use pickle for a pandas data frame, right? And the answer is pickle works for nearly every Python object But it's probably not like optimal for nearly any Python object Right, so because like we were looking at pandas data frames with like over a hundred million rows We really want to save that quickly And so feather is a format that's specifically designed for that purpose And so it's going to do that really fast if we try to pickle it it would have been taken a lot longer, right? Also note that pickle files are only for Python so you can't give them to somebody else where it's like a feather file You can hand around So it's worth knowing that pickle exists because if you've got some Dictionary or some kind of object floating around that you want to save for later or send to somebody else you can always just Pickle it. Okay. So in this particular case the folks at deep learning net were kind enough to provide a pickled version Pickle has changed slightly over time And so old pickle files like this one you actually have to this is a Python 2 one So you have to tell it that it was encoded using this particular Python 2 character set But other than that Python 2 and 3 you can normally open each other's pickle files. All right, so once we've loaded that in We loaded in like so and so this thing which we're doing here This is called destructuring and so destructuring means that load mNIST is giving us back a tuple of tuples and So if we have on the left-hand side of the equals sign a tuple of tuples we can fill all these things in So we're given back a tuple of training data a tuple of validation data and a tuple of test data In this case, I don't care about the test data. So I just put it into a variable called underscore which kind of by like People and pick Python people tend to think of underscore as being a special variable Which we put things we're going to throw away into it's actually not special But it's just it's really common if you see something assigned to underscore it probably means you're just throwing it away, right? By the way in a Jupiter notebook It does have a special meaning which is the last cell that you calculate is always available in underscore by the way But that's kind of a separate issue So then the first thing in that tuple is itself a tuple And so we're going to stick that into x and y for our training data, and then the second one goes into x and y for our validation data Okay, so that's called destructuring and it's pretty common in lots of languages Some languages don't support it, but those that do life becomes a lot easier So as soon as I you know look at some new data set, I just check out what's what have I got right? So what's its type? Okay, it's a numpy array What's its shape? It's fifty thousand by seven eight four, and then what about the dependent variables? That's an array its shape is fifty thousand so This image is Not of length seven eight four. It's of size twenty eight by twenty eight So what happened here? Well, we could guess and we can check on the website. It turns out we would be right That all they did was they took the second row and Catenated to the first row and the third row and concatenated to that and the fourth row and concatenated to that So in other words, they took this whole twenty eight by twenty eight and flattened it out into a single One-D array that makes sense. So it's going to be of size twenty eight squared This is not like normal by any means So don't think like everything you see is going to be like this most of the time when people share images They share them as JPEGs or PNGs. You load them up. You get back a nice 2d array But in this particular case for whatever reason the thing that they pickled was flattened out to be 784 okay, and this word flatten is very common with You know kind of working with tensors. So when you flatten a tensor, it just means that you're turning it into a Lower rank tensor than you started with in this case We started with a rank two tensor and a matrix for each image and we turned each one into a rank one tensor IE the vector so overall the whole thing, you know is a rank two matrix Rank two tensor rather than a rank three tensor. So just to remind us of You know the jargon here Um This and math we would call a vector Right and computer science. We would call it a 1d array But because deep learning have people have to come across as smarter than everybody else we have to call this a rank one Tensor okay, they all mean the same thing more or less Unless you're a physicist in which case this means something else and you get very angry at the deep learning people because you say it's not a tensor So there you go. Don't blame me. This is just what people say so this Is either a matrix or a 2d array or a rank Two tensor and so once we start to get into three dimensions, we start to run out of mathematical names Right, which is why we start to be nice just to say rank three tensor And so there's actually nothing special about vectors and matrices that make them in any way more important Than rank three tensors or rank four tensors or whatever. So I try not to use the terms vector and matrix where possible Because I don't really think they're They're any more special than any other rank of tensor Okay, so kind of it's good to get used to thinking of this as a rank two tensor okay, and then the the rows and Columns If it was a if we're computer science people we would call this dimension zero and dimension one But if we're deep learning people we would call this axis zero or axis one Okay, and then just to be really confusing if you're an image person This is the first axis and this is the second axis Right. So if you think about like TVs, you know 1920 by 1080 columns by rows Everybody else including deep learning and mathematicians rows by columns So this is pretty confusing if you use like the Python imaging library You get that columns by rows pretty much everything else rows by columns. So be careful Because they hate us Because they're bad people I Guess I mean, there's a lot of just Particularly in deep learning like a whole lot of different areas have come together like information theory computer vision Statistics signal processing and you've ended up with this hodgepodge of nomenclature in deep learning often like Every version of things will be used. So today we're going to hear about something that's called either negative log likelihood or by no mere look categorical cross entropy Depending on where you come from We've already seen something that's called either one hot encoding or dummy variables depending on where you come from I really it's just like the same concept gets kind of somewhat independently invented in different fields and Eventually they find their way to machine learning and then we don't know what to call them So we call them all of the above something like that So I think that's what's happened with with computer vision rows and columns so There's this idea of normalizing data, which is subtracting out the mean and dividing by the state of deviation So a question for you Do you like often it's important to normalize the data so that we can more easily train a model Do you think it would be important to normalize the independent variables? For a random forest if we're training a random forest. I'll be honest. I don't know Why we don't need to normalize. I just know that we don't we don't okay. Does anybody want to think about why Kara? it wouldn't matter because each scaling and Transformation we can have will be applied to each row and we will be computing means as we were doing like local Averages and at the end we will of course want to denormalize it back To give so it wouldn't change the result. I'm doing that the independent variables not the dependent variable I thought you asked about Okay, I want to have a go Matthew It might be because we just care about the relationship between the independent variables and the dependent variable So scale does it matter? Okay, go on How what why what why do we only could like because at each split point we can just divide to see Which Regardless of what scale you're on What minimizes variance and that would right so really the key is that when we're deciding where to split All that matters is the order Like it all that matters is how they're sorted. So if we divide by the Subtract the mean and divide by the standard deviation. They're still sorted in the same order So remember when we implemented the random first we said sort them and then we'd like it then we completely ignored the values We just said like now add on one thing from the dependent at a time So so random forests only care about the sort order of the independent variables They don't care at all about their size and so that's why they're wonderfully immune to outliers Because they totally ignore the fact that it's an outlier They only care about which ones higher than what other thing, right? So this is an important concept. It doesn't just appear in random forests It occurs in some metrics as well for example area under the ROC curve You come across a lot that area under the ROC curve Completely ignores scale and only cares about sort We saw something else when we did the dendrogram Spearman's correlation is a rank correlation only cares about order not about scale So random forests One of the many wonderful things about them are that we can completely ignore a lot of these statistical distribution issues But we can't for deep learning because for deep learning. We're trying to train a parameterized model So we do need to normalize our data If we don't then it's going to be much harder to Create a network that trains effectively so we grab the mean and the standard deviation of our training data and Subtract out the mean divide by the standard deviation and that gives us a mean of zero and a standard deviation of one now for our validation data We need to use the standard deviation and mean from the training data, right? We have to normalize it the same way Just like Categorical variables we had to make sure they had the same indexes mapped to the same levels for a random forest or Missing values we had to make sure we had the same median used when we were replacing the missing values You need to make sure anything you do in the training set You do exactly the same thing in the test and validation set So here I'm subtracting out the training set mean the training sets down to deviation So this is not exactly zero and this is not exactly one, but it's pretty close And so in general if you find you try something on a validation set or a test set And it's like much much much worse than your training set. It's probably because you Normalized in an inconsistent way or encoded categories in inconsistent way or something like that All right, so let's take a look at some of this data So we've got 10,000 images in the validation set and each one is a rank one tensor of length seven eight four In order to display it. I want to turn it into a rank two tensor of 28 by 28 so there's a NumPy has a reshape function that takes a tensor in and Reshapes it to whatever size tensor you request Now if you think about it, you only need to tell it About if there are D axes you only need to tell it about D minus one of the axes you want because the last one It can figure out for itself Right, so in total there are ten thousand by seven hundred and eighty four numbers Here all together, right? So if you say well, I want my last axes to be twenty eight by twenty eight Then you can figure out that this must be Ten thousand otherwise it's not going to fit It makes sense. So if you put minus one it says like make it as big or as small as you have to to make it fit And so you can see here it figured out. It has to be ten thousand So you'll see this used in neural net software Pre-processing and stuff like that all the time like I could have written ten thousand here But I try to get into a habit of like any time. I'm referring to like how many items in my input I tend to use minus one because like it just means later on I could like use a sub sample this code wouldn't break I could you know do some kind of stratified sampling. It was unbalanced this code wouldn't break So by using this kind of approach of saying like minus one here for the size It just makes it more resilient to change us later. It's a good habit to get into so this kind of idea of like being able to take tensors and reshape them and and and Change axes around and stuff like that is something you need to be like totally do without thinking Because it's going to happen all the time. So for example, here's one. I tried to read in some images They were flattened. I need to unflatten them into a bunch of matrices. Okay reshape thing. I Read some I read some images in with open CV and it turns out open CV Orders the channels blue green red Everything else expects them to be red green blue. I need to reverse the last axis. How do you do that? I read in some images with Python imaging library. It reads them as you know rows by columns by channels Pytorch expects channels by rows by columns. How do I? Transform that so these are all things you need to be able to do without thinking like straight away Because they just it happens all the time and you never want to be sitting there thinking about it for ages So make sure you spend a lot of time over the week just practicing with things like all the stuff We're going to see today reshaping slicing reordering dimensions Stuff like that. And so the best way is to create some small Tenses yourself and start thinking like okay. What's like experiment with so here? Can we pass that over there? Do you mind if I backtrack a little bit? Of course, I love it. So back in Normalize you say like you might have gone over this, but I'm still like Wrestling with a little bit. Yeah, many machine learning algorithm behave better when the data is normalized Yeah, but you also just said that scales and really matter So I said it doesn't matter for random forests. Okay. Yeah, so random forests I just kind of spit things based on order and so we love them We love random forests for the way. They're so Immune to worrying about distributional assumptions, but we're not doing random forests. We're doing deep learning and deep learning does care Can you pass it over there? We have a parametric then we should scale if we have a non-parametric then we should not have to scale No, not quite right because like K nearest neighbors is non-parametric and scale matters a hell of a lot so I would say Things involving trees generally are just going to split at a point and so probably you don't care about scale But you know, you probably just need to think like is this an algorithm That uses order or does it use specific numbers? Can can you please give us an intuition of why it needs scale just because that would make clarify Some of the issues not until we get to doing SGD. So we're going to get to that Yeah, so for now, we're just going to say take my word for it. Can you pass it to Daniel? So this is probably a dumb question But can you like explain a little bit more what you mean by scale because I guess when I think of scale I'm like, oh all the numbers should be generally the same size That's exactly what we mean But is that like the case like with the cats and dogs that we went over with like the deep learning like you could Have a small cat and like a larger cat, but it would still know that those were both cats Oh, I guess, you know, this is one of these problems where language gets overloaded. Yeah So in computer vision when we scale an image, we're actually increasing the size of the cat in this case We're scaling the actual pixel values So in both case scaling means to make something bigger and smaller in this case We're taking numbers from 0 to 255 and making them so that they have an average of 0 and a standard deviation of 1 Jeremy Could you please explain us is it by column by row? by pixel By pixel. So there's a single general when you're scaling In my just not thinking about a picture, but I'm kind of an input. Yeah much learning. So, okay Yeah, sure. So, I mean, it's a little bit subtle But in this case, I've just got a single mean and a single standard deviation, right? So it's basically on average how How much black is there? Right. And so on average, you know, we have a mean and a standard deviation Across all the pixels In computer vision, we would normally do it by channel. So we would normally have one number for red one number for green one number for blue in General you you need a different set of normalization coefficients for each like Each thing you would expect to behave differently. So if we were doing like a structured data set where we've got like income Distance in kilometers and number of children like you need three separate normalization coefficients for those They're like very different kinds of things So, yeah, it's kind of like a bit domain specific here. It's like in this case all of the pixels are You know levels of gray. So we've just got a single scaling number Where else you could imagine if they were red versus green versus blue, you could need to scale those channels in different ways So I'm having a bit of trouble Imagining what would happen if we don't normalize in this case So when we'll get there so for net so we so this is kind of what you net was saying It's like why do we normalize and for now we're normalizing because I say we have to When we get to looking at stochastic gradient descent We'll basically discover that if you Basically to skip ahead a little bit we're going to be doing a matrix multiply by a bunch of weights We're going to pick those weights in such a way that when we do the matrix multiply we're going to try to keep the numbers at the same scale that they started out has and That's going to basically require the initial numbers. We're going to have to know what their scale is So basically it's much easier to create a single kind of neural network architecture That works for lots of different kinds of inputs if we know that they're consistently going to be mean zero Standard deviation one that would be the short answer But we'll learn a lot more about it and if in a couple of lessons You're still not quite sure why let's come back to it because it's a really interesting thing to talk about Yes, I'm just trying to visualize the axes we're working with here So under plots when you when you write so X valid shape we get ten thousand by seven eight four Yeah, I mean that we brought in ten thousand pictures. Yeah of that dimension. Exactly. Okay, and then in the next line When you choose to reshape it is the reason why you put 2828 on as an Y or Z coordinates or is there a reason why they're in that order? Yeah, there is pretty much all Neural network libraries assume that the first axis It's like it's kind of the equivalent of a row. It's like a separate thing. It's a sentence or an image or You know example of sales or whatever. So I want each image You know to be a separate item of the first axis And then so that leaves two more axes for the rows and columns of the images and that's pretty standard That's totally standard. Yeah, I don't think I've ever seen a library that doesn't work that way Can you pass it to our bureau So while normalizing the validation data, I saw you have Used mean of X and standard deviation of X data training data only. Yes So shouldn't we use mean and standard deviation of validation data? You mean like drawing them together or No separately calculating mean no because you see then you would be normalizing the validation set Using different numbers and so now the meaning of like this This pixel has a value of three in the validation set has a different meaning to the meaning of three in the in the training set It would be like If we had like days of the week Encoded such that Monday was a one in the training set and was a zero in the validation set We've got now two different sets where the same number has a different meaning So we want to make sure that we so let me give you an example Let's say we were doing like full color images and our Tests their training set can contain like green frogs green snakes and gray elephants Right, we're trying to figure out which was which and we normalized using, you know the each each channel mean and then we have a Validation set and a test set which are just green frogs and green snakes So if we were to normalize by the validation sets Statistics we would end up saying things on average are green and so we would like remove all the greenness out and So we would now fail to recognize the green frogs and the green Snakes effectively right so we actually want to use the same Normalization coefficients that we were training on and for those of you during the deep learning class We actually go further than that when we use a pre-trained network We have to use the same normalization coefficients that the original authors trained on so the idea is that you know that a number Needs to have this consistent meaning across every data set where you use it How can you pass it to us meet up? That means when you're looking at the test set you normalize the test set based on this this mean it's that's right Okay so Here's a you know so so the valid validation y values are just rank one tensor of 10,000 remember this is kind of weird python thing where a tuple with just one thing in it needs a trailing comma Okay, so this is a rank one tensor of length 10,000 and so here's an example of something from that. It's just the number three So that's our labels. So here's another thing you need to be able to do in your sleep slicing into a tensor So in this case, we're slicing into the first axis With zero that means we're grabbing the first slice So because this is a single number. This is going to reduce the rank of the tensor by one It's going to turn it from a three-dimensional tensor into a two-dimensional tensor, right? So you can see here. This is now just a matrix and then we're going to grab 10 through 14 inclusive rows 10 through 14 inclusive columns and here it is right? So this is the kind of thing you need to be super comfortable like grabbing pieces out looking at the numbers and Looking at the picture right so here's an example of a little piece of that first image and so You kind of want to get used to this idea that if you're working with something like pictures or audio You know, this is something your brain is really good at interpreting, right? So like keep showing pictures of what you're doing whenever you can But also remember behind the scenes their numbers So like if something's going weird print out a few of the actual numbers You might find somehow some of them have become infinity or they're all zero or whatever, right? So like use this interactive environment and to explore the data as you go Did you have a question where's the box? just a quick I guess Semantic question why when it's a tensor of rank three? Is it stored as like X Y Z instead of like to me it would make more sense to store it as like a list of Like 2d tensors it's let's do it as either right so for the formatting because let's look at this as a 3d Okay, so here's a 3d Right so a 3d tensor is formatted as showing a list of 2d tensors basically But when you're extracting it why isn't it like if you're extracting the first one why isn't it X images? Square brackets zero closed square brackets, and then a second set of square because that has a different meaning right so It's kind of the difference between Tenses and jagged arrays right so basically if you do like something like Something like that that says take the second list item and from it grab the third list item And so we tend to use that when we have something called a jagged array Which is where each sub array may be of a different length Right where else we have like a single object of Three dimensions and so we're trying to say like which little piece of it do we want? And so the idea is that that is a a single slice object to go in and grab that piece out Okay So here's a example of a few of those images along with their labels and This kind of stuff you want to be able to do pretty quickly with matplotlib It's it's going to help you a lot in in life in your exam So you can have a look at you know what Rachel wrote here when she wrote plots we can use we can use add subplot to basically create those little separate plots and You need to know that I am show is how we basically take a numpy array and draw it as a picture Okay, and then we've also added the title on top So there it is. All right, so Let's now Take that data and try to build a neural network With it and so a neural network And sorry, this is going to be a lot of review for those of you already doing deep learning A neural network is just a particular mathematical function or a class of mathematical functions But it's a really important class because it has the property it supports what's called the universal approximation theorem Which is that which means that a neural network can approximate any other function arbitrarily closely Right. So in other words it can do in theory. It can do anything as long as we make it big enough So this is very different to a function like 3x plus 5 Right, which can only do one thing. It's a very it's a specific function or the class of functions a x plus b Which can only represent lines of different slopes moving it up and down different amounts Or even the function a x squared plus b x plus c plus sine d You know again only can represent a very specific subset of relationships The neural network however is a function that can represent any other function to arbitrarily close accuracy All right So what we're going to do is we're going to learn how to take a function And so let's take work a x plus b and we're going to learn how to find its parameters in this case a and b Which allow it to fit as closely as possible to a set of data and So this here is showing example From a notebook that we'll be looking at in deep learning course Which basically shows what happens when we use something called stochastic gradient descent To try and set a and b and basically what happens is we're going to pick a random a to start with a random B to start with and then we're going to basically figure out Do I need to increase or decrease a to make it closer the line closer to the dots? Do I need to increase or decrease b to make the line closer to the dots and then just keep increasing and decreasing a and b lots And lots of times okay, so that's what we're going to do and to answer the question Do I need to increase or decrease a and b we're going to take the derivative? No, it's so the derivative of the function with respect to a and b tells us how will that function change as we change a and b All right, so that's basically what we're going to do, but we're not going to start with just a line The idea is we're going to build up to actually having a neural net and so it's going to be exactly the same idea But because it's an infinitely flexible function We're going to be able to use this exact same technique to fit arbitrarily to arbitrarily complex relationships Now that's basically the idea So then what you need to know is that a neural net is Actually a very simple thing a neural net actually is something which takes As input let's say we've got a vector Does a matrix product by that vector right, so this is like this is of size. Let's draw this properly So like if this is size R, this is like R by C a matrix product will spit out something of size C All right, and then we do something called a non linearity which is basically we're going to throw away all the negative values So it's basically max 0 comma x and then we're going to put that through another matrix model ply And then we're going to put that through another max 0 comma x And we're going to put that through another matrix model ply and so on right until eventually we end up with The single vector that we want So in other words each stage of our neural network is the key thing going on is a Matrix model play so in other words are a linear function So basically deep learning most of the calculation is Lots and lots of linear functions, but between each one We're going to replace the negative numbers with zeros. Can you possibly Yes, so why are we throwing away the negative numbers as we go through this well We'll see right the short answer is if you apply a linear function to a linear function to a linear function It's still just a linear function So it's totally useless, but if you throw away the negatives That's actually a non linear transformation And so it turns out that if you apply a linear function to the thing We threw away the negatives that applied that to a linear function that creates a neural network And it turns out that's a thing that can approximate any other function Arbitrarily closely so there's tiny little difference Actually makes all the difference and if you're interested in it Check out the deep learning video where we cover this because I actually show a A nice visual intuitive proof not something that I created but something that Michael Nielsen created Or if you want to skip straight to his website You could go to Michael Nielsen universal I Think I spelled his name wrong never mind mission theorem There we go neural networks and deep learning chapter 4 and he's got a really nice Walkthrough basically with lots of animations where you can see Why this works? One I feel like the the the hardest thing I Feel like the hardest thing with getting started like technical writing on the internet is just like Posting your first thing so If you do a search for Rachel Thomas medium blog you'll find this we'll put it on the lesson wiki Where she talks about she actually says the top advice she would give to her younger self would be to start blogging sooner and she has like Both reasons why you should do it Some examples of things that you know examples of places She's blogged it's turned out to be great for her and her career But then some tips about how to get started I remember when I first suggested to Rachel she might think about blogging because she had so much interesting to say and You know at first she was kind of surprised at the idea that like she could blog You know and now people come up to us at Conferences and they're like you're Rachel Thomas. I love your writing You know so like I've kind of seen that that transition from like wow could I blog to to being known as a strong technical author so yeah, so check out this article If you still need convincing or if you're wondering how to get started And since the first one is the hardest Maybe your first one should be like Something really easy for you to write, you know, so it could be like you know, here's a summary of the first 15 minutes of Lesson three of our machine learning course, you know, here's why it's interesting. Here's what we learned or it could be like Here's a summary of how I used a random forest to solve a particular problem in my practicum I often get questions like on my practicum my organization. We've got like sensitive commercial data That's fine. Like, you know just find another data set and do it on that instead to show the example or You know anonymize all of the values and change the names of the variables or whatever Like you can talk to your employer or your practicum partner to make sure that they're comfortable With whatever it is you're writing in general though, you know People love it when their interns and staff blog about what they're working on because it makes them look super cool, you know, it's like hey, I'm a you know Intern working at this company and I wrote this post about this cool analysis I did and then other people would be like wow, that looks like a great company to work for so generally speaking You should find people are pretty supportive Besides which there's lots and lots of data sets out there available So even if you can't base it on the work you're doing you can find something similar for sure All right, so we're going to start building our neural network. We're going to build it Using something called Pytorch Pytorch is a library that basically looks a lot like NumPy But when you create Some code with Pytorch you can run it on the GPU rather than the CPU so the GPU is is Something which is basically going to be probably At least an order of magnitude possibly hundreds of times faster than the code that you might write for the CPU For particularly stuff involving lots of linear algebra, right? So with deep learning neural nets You can if you if you don't have a GPU you can do it on the CPU, right, but it's it's going to be frustratingly slow Your Mac does not have a GPU that we can use for this Because I'm actually advertising today. We need an Nvidia GPU I would actually much prefer that we could use your max because competitions great, right? But Nvidia were really the first ones to create a GPU which did a good job of supporting general purpose Graphics programming units GP GPU. So in other words, that means using a GPU for things other than playing computer games They used they created a framework called CUDA CUDA It's it's a very good framework. It's pretty much universally used in deep learning If you don't have an Nvidia GPU, you can't use it. No no current max have an Nvidia GPU Most laptops of any kind don't have an Nvidia GPU if you're interested in doing deep learning on your laptop The good news is that you need to buy one which is really good for playing computer games on There's a place called exotic PC gaming laptops where you can go and buy yourself a great laptop for Doing deep learning. You can tell your parents that you need the money To do deep learning. So could you please have? Yeah, so you're generally find a whole bunch of laptops with names like predator and viper With pictures of robots and stuff so Stealth pro raider lappard anyway Having said that like I don't know that many people that do much deep learning on their laptop Most people will log into a cloud environment By far the easiest I know of to use is called Cresol With Cresol you can basically sign up and straight away the first thing you get is a Throwing straight into a Jupiter notebook Backed by a GPU cost 60 cents an hour With all of the fast AI libraries and data already available So that makes life really easy It's less Flexible and in some ways less fast than using AWS, which is the Amazon web services option Costs a little bit more 90 cents an hour rather than 60 cents an hour But it's very likely that your employer is already using that. It's like it's good to get to know anyway They've got more different choices around GPUs and it's a good good choice if you Google for a GitHub student pack if you're a student You can get a hundred and fifty dollars of credits Straight away pretty much and so that's a really good way to get started. Daniel. Did you have a question? I Just wanted to know your opinion on I know that Intel recently published like an open source like way of like boosting like regular packages that they claim is Equivalent like if you use the bottom tier GPU on your seat like on your CPU if you use their boost packages Like you can get the same performance Do you know anything about that? Yeah, I do. It's a good question. So I'm actually Intel makes some great numerical programming libraries particularly this one called MKL the matrix kernel library they Definitely make things faster than not using those libraries But if you look at a graph of performance over time GPUs have consistently throughout the last ten years including now Are about ten times more floating point operations per second than the equivalent CPU? And they're generally about a fifth of the price for that performance So yeah, it And then because of that like everybody doing anything with deep learning basically does it on Nvidia GPUs and therefore using anything other than Nvidia GPUs is currently very annoying So slower more expensive more annoying I really hope there will be more activity around AMD GPUs in particular in this area But AMD's got like literally years of catching up to do so it might take a while Yeah, so I just wanted to point out that you can also buy things such as like a GPU extender to a laptop Yeah, that's also like kind of making like maybe a first step solution. Yeah, you really want to put something on yeah Yeah, I think for like 300 bucks or so you can buy something that plugs into your Thunderbolt port If you have a Mac and then for another five or six hundred bucks you can buy a GPU to plug into that Having said that for about a thousand bucks. You can actually Create a pretty good, you know GPU based desktop And so if you're considering that the fast AI forums have like lots of threads where people help each other Speck out something at a particular price point Anyway, so to start with I'd say use Cresol and then You know when you're ready to invest a few extra minutes getting going use AWS To use AWS you basically Yeah Yeah, I'm just talking to the folks online as well, so Okay, so So AWS when you get there go to EC2 EC2 like there's lots of stuff on AWS EC2 is the bit where we get to like rent computers by the hour, right? Now we're going to need a GPU based instance Unfortunately, when you first sign up for AWS, they don't give you access to them So you have to request that access so go to limits up on the top left, right and the main GPU instance we'll be using is called the P2 so scroll down to P2 and here P2 dot x large You need to make sure that that number is not zero if you've just got a new account It probably is zero which means you won't be allowed to create one You have to go request limit increase and the trick there is when it asks you why do you want the limit increase type? fast.ai because AWS knows to look out and they know that fast.ai people are good people So they'll do it quite quickly That takes a day or two generally speaking to go through So once you get the email saying you've been approved for P2 instances You can then go back here and say launch instance And so we've basically set up one that has everything you need So if you click on community AMI and AMI is an Amazon machine image It's basically a completely set up Computer right and so if you type fast.ai or one word You'll find here fast.ai deal part one version two for the P2 right so that's all set up ready to go so if you click on Select and then it'll say okay. What kind of computer do you want right and so we have to say all right I want a GPU compute type and Specifically I want a P2 extra lunch Right, and then you can say review and launch I'm assuming you already know how to deal with SSH keys and all that kind of stuff if you don't check out the introductory tutorials and workshop videos that we have online Or Google around for SSH keys Very important skill to know anyway. All right, so hopefully you get through all that you have Something running on a GPU with the fast AI repo if you use Cresol just CD fast AI to the repos already there get pull AWS CD fast AI the repo is already there get pull If it's your own computer you just have to get blown and then away you go All right, so Part of all of those is PyTorch is pre-installed and so PyTorch basically means we can write code that looks a lot like NumPy But it's going to run really quickly on the GPU Secondly Since we need to know like which direction and how much to move our parameters to improve our loss We need to know the derivative of functions PyTorch has this amazing thing where any code You write using the PyTorch library it can automatically take the derivative of that for you So we're not going to look at any calculus in this course And I don't look at any calculus in any of my courses or at any of my work Basically ever in terms of like actually calculating derivatives myself because I've never had to It's done for me by the library. So as long as you write the Python code, it's the derivative is done So the only calculus you really need to know to be an effective practitioner is like what does it what does it mean to be a derivative? And you also need to know the chain rule which will come to All right, so we're going to start out kind of top-down Create a neural net and we're going to assume a whole bunch of stuff and gradually we're going to dig into each piece Right. So to create neural nets. We need to import the PyTorch neural net library PyTorch funnily enough is not called PyTorch. It's called torch. Okay, so torch.nn is the PyTorch Subsection that's responsible for neural nets. Okay, so we'll call that NN And then we're going to import a few bits out of fast AI just to make life a bit easier for us So here's how you create a neural network in PyTorch The simplest possible neural network You say sequential and sequential means I am now going to give you a list of the layers that I want in my neural network Right. So in this case my list has two things in it The first thing says I want a linear layer. So a linear layer is something that's basically going to do y equals a x plus b right But Matrix matrix multiply not not univariate obviously So it's going to do a matrix product basically So the input to the matrix product is going to be a vector of length 28 times 28 because that's how many Pixels we have and the output needs to be of size 10. We'll talk about why in a moment But for now, you know, this is how we define a linear layer And then again, we're going to dig into this in detail But every linear layer just about in neural nets has to have a non-linearity after it And we're going to learn about this particular non-linearity in a moment It's called the softmax and if you've done the DL course, you've already seen this So that's how we define a neural net. This is a two-layer neural net There's also kind of an implicit additional first layer, which is the input But with pi torch, you don't have to explicitly mention the input that normally we think conceptually like the input image is kind of also a layer Because we're kind of doing things pretty manually With pi torch, we're not taking advantage of any of the convenience is in fast AI for building this stuff We have to then write dot CUDA which tells pi torch to copy this neural network across to the GPU So now on from now on that network is going to be actually running on the GPU if we didn't say that it would run on the CPU So that gives us back a neural net a very simple neural net So we're then going to try and fit the neural net to some data. So we need some data So fast AI has this concept of a model data Object, which is basically something that wraps up training data Validation data and optionally test data and so to create a Model data object. You can just say I want to create some image classifier data I'm going to grab it from some arrays Right, and you just say okay. This is the path that I'm going to save any temporary files This is my training data Arrays and this is my validation data Arrays, okay, and so that just returns an object that's going to wrap that all up And so we're going to be able to fit to that data So now that we have a neural net and We have some data. We're going to come back to this in a moment But we basically say what loss function do we want to use what optimizer do we want to use and then we say Fit we say fit this network to this data going over every image once Using this loss function and this optimizer and print out these metrics Bang, okay, and this says here. This is ninety one point eight percent accurate Okay, so that's like the simplest possible neural net so what that's doing is It's creating a Matrix multiplication followed by a nonlinearity and then it's trying to find The values for this matrix Which cause which basically that fit the data as well as possible that a product that end up predicting This is a one. This is a nine. This is a three and So we need some definition for as well as possible And so the general term for that thing is called the loss function So the loss function is the function that's going to be lower if this is better Right just like with random forests. We had this concept of information gain And we got to like pick what function do you want to use to define information gain and we were mainly looking at root mean squared error, right? Most machine learning algorithms we call something very similar that loss, right? So the loss is how do we score how good we are and so in the end we're going to calculate the derivative of the loss with respect to the The weight matrix that we're multiplying by to figure out how to how to update it, right? So we're going to use something called negative log likelihood loss So negative log likelihood loss is also known as cross entropy They're literally the same thing. There's two versions one called binary cross entropy or binary negative log likelihood and another called category called cross entropy and the same thing One is for when you've only got a zero or one dependent. The other is if you've got like Cat dog airplane or horse or zero one through nine or so forth so what we've got here is the binary version of cross entropy and so here is the definition I Think maybe the easiest way to understand this definition is to look at an example So let's say we're trying to predict cat versus dog One is cat zero is dog So here we've got cat dog dog cat and Here are our predictions. We said 90 percent sure. It's a cat 90 percent sure. It's a dog 80 percent sure. It's a dog 80 percent sure. It's a cat All right, so we can then calculate the Binary cross entropy by calling our function. So it's going to say okay for the first one We've got y equals one. So it's gonna be one times log of point nine Plus one minus y one minus one is zero. So that's going to be skipped Okay, and then the second one is going to be a zero So it's going to be zero times something. So that's going to be skipped and then second part will be one minus zero Ah, so this is one times log of one minus P One minus point one is point nine So in other words the first piece and the second piece of this are going to give exactly the same number Which makes sense because the first one we said we were 90 percent confident It was a cat and it was and the second we said we were 90 percent confident It was a dog and it was so in each case the loss is coming from the fact that you know We could have been more confident. Yeah, so if we said we're a hundred percent confident the loss would have been zero Okay, so let's look at that in Excel So here's our Point nine point one point two point eight right and here's our predictions 101. So here's one minus the prediction Right here is log of our prediction Here is log of one minus our prediction. And so then here is our sum Okay So if you think about it And I want you to think about this during the week you could replace this with an if statement Rather than why because why is always one or zero Right, then it's only ever going to use either this or this So you could replace this with an if statement. So I'd like you during the week to try to rewrite this with an if statement, okay, and Then see if you can then Scale it out to be a categorical cross entropy. So categorical cross entropy works this way. Let's say we were trying to predict Three and then six and then seven and then two So if we were trying to predict three and the actual thing that was predicted was like four point seven Right versus like well actually think of this way We're trying to predict three and we actually predicted five Or we're trying to predict three and we accidentally predicted nine Like being five instead of three is no better than being nine instead of three So we're not actually going to say like how far away is the actual number? We're going to express it differently Or to put it another way. What if we're trying to predict cats dogs horses and airplanes? You can't like how far away is cat from horse? So we're going to express these a little bit differently rather than thinking of it as a three Let's think of it as a Vector With a one in the third location and rather than thinking of it as a six Let's think of it as a vector of zeros with a one in the sixth location So in other words one hot encoding, right? So let's one hot encode our dependent variable And so that way now rather than predicting trying to predict a single number. Let's predict Ten numbers All right, let's predict. What's the probability that it's a zero? What's the probability? It's a one What's the probability? It's a two and so forth, right? And so let's say we're trying to predict the two Right, then here is our binary cross entropy. Sorry categorical cross entropy. So it's just saying okay Did this one predict correctly or not how far off was it and so forth for each one? Right and so add them all up. So categorical cross entropy is identical to binary cross entropy We just have to add it up across all of the categories So try and turn the binary cross entropy Function in Python into a categorical cross entropy Python and maybe create both the version with the if statement and the version with the sum and the product All right All right, so that's why in our pie torch We had ten as the output as the output dimensionality for this matrix Because when we multiply something by a matrix with ten columns We're going to end up with something of length ten, which is what we want. We want to have ten predictions Okay so That's the loss function that we're using All right, so then we can fit the model And what it does is it goes through every image This many times in this case. It's just looking at every image once and going to slightly update the values in that weight matrix based on those gradients and So once we've trained it we can then say predict Using this model on the validation set Right, and now that spits out something of ten thousand by ten Can somebody tell me why is this of shape these predictions? Why are they of shape ten thousand by ten? Very for Chris. It's right next to you. Well, it's because we have ten thousand images What we're training on ten thousand images training on so that's what we're validating on But same thing so ten thousand be validating on so that's the first axis That's the first and then second axis is because we actually make ten predictions per image good good exactly So each one of these rows is the probabilities that it's a naught that it's a one that is a two that's three and so forth Okay, very good So in math, there's a really common Operation we do called arg max and when I say it's common. It's funny like At high school, I never saw arg max First-year undergrad I never saw arg max, but somehow after university Everything's about arg max. So it's one of these things. It's for some reason not really taught at school But it's actually turns out to be super critical And so arg max is both something that you'll see in math and it's just written out in full arg max It's in numpy it's in pi torch it's super important and what it does is it says Let's take this array of preds, right and let's figure out on this axis Remember axis one is columns, right? So across as Chris said the ten predictions for each one free throw Let's find which prediction has the highest value and return not that if it does it max It would return the value arg max returns the index of the value, right? So by saying arg max axis equals one it's going to return The index which is actually the number itself, right? So let's grab the first five Okay, so for the first one it thinks there's a three then it thinks x one's an eight Next one's a six the next one's a nine next one's a six again. Okay, so that's how we can convert our probabilities back into predictions All right, so if we save that away call it preds we can then say, okay, when does preds equal? the ground truth Right, so that's going to return an array of bulls Which we can treat as ones and zeros and the mean of a bunch of ones and zeros Is just the average so that gives us the accuracy So there's a ninety one point eight percent and so you want to be able to like Replicate the numbers you see and here it is. There's a ninety one point eight percent All right, so when we train this it tells us the last thing it tells us is Whatever metric we asked for and we asked for accuracy Okay, so the last thing it tells us is our metric which is accuracy and then before that we get the training set loss and The loss is again, whatever it was we asked for negative log likelihood and the second thing is the validation set loss Pi torch doesn't use the word loss. They use the word criterion. So you'll see here crit Okay, so that's criterion equals loss. This is what loss function. Do we want to use they call that the criterion same thing? Okay So here is how we can recreate that accuracy so now we can go ahead and plot eight of the images along with their predictions and we got three eight six nine oh wrong Five wrong Okay, and you can see like why they're wrong like this is pretty close to a nine. It's just missing a little cross at the top This is pretty close to a five. It's got a little bit of the extra here, right? So we've made a start and and all we've done so far is we haven't actually created a deep neural net We've actually got only one layer So what we've actually done is we've created a logistic regression Okay, so a logistic regression is is literally what we just built and you could try and replicate this with SK learns logistic regression package when I did it I got Similar accuracy, but this version ran much faster because this is running on the GPU Where else SK learn runs on the CPU? Okay, so even for something like logistic regression we can you know implement it very quickly with pi torch How can you pass that to him? So when we're when we're creating our net we have to do dot CUDA What would be the consequence of not doing that would it just not run? It wouldn't run quickly. Yeah, it'll run on the CPU Can you pass it to Jade? So maybe the neural network. Why is that we have to do linear and followed by a nonlinear? So The short answer is because that's what the universal approximation theorem says is the structure which can give you Arbitrally accurate functions or any functional form, you know So the long answer is the details of why the universal approximation theorem works Another version of the short answer is that's the definition of a neural network So the definition of a neural network is a linear layer Followed by a activation function followed by a linear layer followed by an activation function Etc. We go into a lot more detail of this in the deep learning course But you know for this purpose, it's it's enough to know like that it works So far, of course, we haven't actually built a deep neural net at all We've just built a logistic regression And so at this point if you think about it, all we're doing is we're taking every input pixel and multiplying it by a weight For each possible outcome, right? So we're basically saying, you know on average the number one You know has these pixels turned on the number two has these pixels turned on and that's why it's not terribly accurate, right? That's that's not how digit recognition works in real life, but that's That's always built so far Okay, can you pass that to Devin? So you keep saying this universal approximation theorem. Yeah, did you define that? Yeah, but let's cover it again because it's worth talking about so All right, so Michael Nielsen has this great website called neural networks and deep learning and his chapter 4 is Actually kind of famous now and in it. He does this walkthrough of basically showing that a neural network can Can approximate any other function to arbitrarily close Accuracy as long as it's big enough and we walk through this in a lot of detail in the deep learning course but the basic trick is that he shows that with a few different numbers you can basically kind of cause these Things to kind of create little boxes. You can move the boxes up and down You can move them around you can join them together to eventually basically create like connections of towers Which you can like use to approximate any kind of surface, right? So that's you know, that's basically the trick and so all we need to do Given given that is to kind of find the parameters for each of the linear functions In that neural network so to find the weights in each of the in each of the matrices and so so far We've got just one Matrix and so we've just built a simple Logistic regression so far. Good. How did you have a question? Just a small doubt. I just want to confirm that When you showed images of the examples of the images which were misclassified. Yeah They look rectangular. So it's just that while rendering the pixels are being scaled differently So are they still 28 by 28 square 28? I Think they're square. I think they just look rectangular because they've got titles on the top. I'm not sure Yeah, good question. I don't know. Anyway, they are square and like Matt plot lib Yeah, it does often fiddle around with you know, what it considers black versus white and you know Having different size axes and stuff. So yeah, you do have to be very careful there sometimes Okay, so Hopefully this will now make more sense because what we're going to do is like dig in a layer deeper and define Logistic regression without using an end on sequential without using an end linear without using an end log softmax. So we're going to do Nearly all of the layer definition from scratch Okay, so to do that We're going to have to define a pytorch module a pytorch module is basically either a neural net or a layer in a neural net Which is actually kind of a powerful concept of itself Basically anything that can kind of behave like a neural net can itself be part of another neural net And so this is like how we can construct particularly powerful architectures combining lots of other pieces So to create a pytorch module just create a python class But it has to inherit from NN dot module. So we haven't done inheritance before Other than that, this is all the same concepts we've seen in OO already Basically, if you put something in parentheses here, what it means is that our class gets all of the functionality of this class But free it's called subclassing it So we're going to get all of the capabilities of a neural network module that the pytorch authors have provided and then we're going to add additional functionality to it When you create a subclass There is one key thing you need to remember to do which is when you initialize your class You have to first of all initialize the superclass And that's so the superclass is the NN dot module So the NN dot module has to be built Before you can start adding your pieces to it And so this is just like something you can copy and paste into every one of your modules You just say super dot in it. This just means Construct the superclass first Okay, so Having done that we can now go ahead and define our weights and our bias So our weights is The the weight matrix. It's the actual matrix that we're going to multiply our data by and as we discussed It's going to have 28 times 28 rows and 10 columns and That's because if we take an image Which we flattened out into a 28 by 28 length vector Right, then we can multiply it by this weight matrix to get back out a length 10 vector which we can then use to Consider as a set of predictions So that's our weight matrix now The problem is that we don't just want y equals ax we want y equals ax plus b so the plus b in Neural nets is called bias and so as well as defining weights We're also going to find bias and so since this thing is going to spit out For every image something of length 10 That means that we need to create a vector of length 10 to be our Biases in other words for everything 0 1 2 3 up to 9. We're going to have a different Plus b that would be adding right, so We've got our data matrix here, which is of length 10,000 by 28 times 28 All right, and then we've got our weight matrix Which is 28 by 28? Rows by 10 so if we multiply those together we get something of size 10,000 by 10 Right, and then we want to add on our bias Sorry, wrong way around add on our bias Okay, like so and so when we add on and we're going to learn a lot more about this later, but when we add on a Vector like this it basically is going to get added to every row Okay, so the bias is going to get added to every row So we first of all define those and so to define them We've created a tiny little function called get weights, which is over here, right? Which basically just creates some normally distributed random numbers So torch dot random n returns a tensor filled with random numbers from a normal distribution We have to be a bit careful though When we do deep learning like when we add more linear layers later imagine if we have a Matrix which on average tends to increase the size of the inputs we give to it if we then Multiply by lots of matrices of that size It's going to make the numbers bigger and bigger and bigger like exponentially bigger Or what if it made them a bit smaller? It's going to make them smaller and smaller and smaller exponentially smaller so like because a deep network applies lots of linear layers if on average They result in things a bit bigger than they started with or a bit smaller than they started with it's going to like Exponentially multiply that difference So we need to make sure that the weight matrix is of an appropriate size that the Imports to it the kind of the mean of the inputs basically is not going to change So it turns out that if you use normally distributed random numbers and divided by the number of rows in the weight matrix It turns out that particular random initialization Keeps your numbers at about the right scale right so this idea that like If you've done linear algebra basically if the eigenvalue the first eigenvalue is like bigger than one or Smaller than one it's going to cause the gradients to like get bigger and bigger or smaller and smaller That's called gradient explosion right so We'll talk more about this in the deep learning course, but if you're interested you can look up chiming her initialization and read all about This concept right but for now, you know, it's probably just enough to know that if you use this type of random number generation You're going to get random numbers that are unnicefully behaved You're going to start out with an input which is mean zero standard deviation one Once you put it through this set of random numbers, you'll still have something. That's about mean zero standard deviation one That's basically the goal. Okay One nice thing about pie torch is that you can play with this stuff, right? So torch dot random like try it out like every time you see a function being used run it Right and take a look and so you'll see it looks a lot like NumPy Right, but it doesn't return a NumPy array. It returns a tensor and in fact Now I'm GPU programming Okay, like put doc hooder and now it's doing it on the GPU. So like I Just multiplied that matrix by three very quickly on the GPU. Right. So that's how we do GPU programming with pie torch, right? So this this is our weight matrix So we create as I said we create one 28 by 28 by 10 one is just rank one of 10 for the biases We have to make them a parameter. This is basically telling pie torch Which things to update when it does SGD? That's very minor technical detail So having created the weight matrices We then define a special method with the name forward. This is a special method The word the name forward has a special meaning in pie torch a method called forward in pie torch Is the name of the method that will get called when your layer is calculated? Okay, so if you create a neural net or a layer you have to define Forward and it's going to get past the data from the previous layer So our definition is to do a matrix multiplication of our input data times our weights and add on the biases So that's it. That's what happened earlier on when we said nn.linear. It created this this thing for us Okay now unfortunately, though, we're not getting a 28 by 28 long vector. We're getting a 28 row by 28 column matrix So we have to flatten it Unfortunately in torch pie torch, they tend to rename things They they spell recite reshape. They spell it view. Okay, so view means reshape So you can see here we end up with something where the number of images we're going to leave the same and Then we're going to replace row by column with a single axis again negative one meaning as long as required Okay, so this is how we flatten something using pie torch, so we flatten it do a matrix multiply and then finally We do a soft max. So soft max is the activation function we use If you look in the deep learning repo, you'll find something called entropy example Where you'll see an example of soft max, but a soft max simply takes the outputs from our Final layer so we get our outputs from our from our linear layer and what we do is we go e to the power of for each output and Then we take that number and we divide by the sum of the e to the power of that's called soft max Why do we do that? Well because we're dividing this by the sum That means that the sum of those itself must add to one, right? And that's what we want. We want the probabilities of all the possible outcomes add to one Furthermore because we're using e to the power of that means we know that every one of these is between zero and one and Probabilities we know should be between zero and one And then finally because we're using e to the power of it tends to mean that slightly bigger values in The input turn into much bigger values in the output So you'll see generally speaking my soft max there was going to be one big number and lots of small numbers And that's what we want right because we know that the output is one hot encoded so in other words a soft max Activation function the soft max non-linearity is something that returns things that behave like probabilities and Where one of those probabilities is more likely to be kind of high and the other ones are more likely to be low and we know That's what we want for to map to our one hot encoding So a soft max is a great activation function to use to kind of help the neural net make it easier for the neural net to to map To the output that you wanted This is what we generally want when we're kind of designing neural networks We try to come up with little architectural tweaks that make it as easy for it as possible to to to match The output that we know we want So that's basically it right like rather than doing sequential You know and using an end up linear and end up soft max we have to find it from scratch We can now say just like before our net is equal to that class dot CUDA And we can say dot fit and we get to within a slight random deviation exactly the same output Okay, so I'm what I'd like you to do during the week is to play around with like torch dot random to generate some random tensors Torch dot Matt mole to start multiplying them together adding them up Try to make sure that you can rewrite soft max yourself from scratch You know like try to fiddle around a bit with You know reshaping view all that kind of stuff so that by the time you come back next week You feel like pretty comfortable with pytorch and if you Google for pytorch tutorial You'll see there's a lot of great material actually on the pytorch website To help you along basically showing you how to create tensors and modify them and do operations on them Alright great. Yes. You had a question. Can you pass it over? So I see that the forward is the layer that gets applied after each of the linear layers Well, not quite the forward is just the definition of the module. So this is like how we're this is how we're implementing linear So does that mean after each linear layer? We have to apply the same function. Let's say we can't do a Log softmax after layer one and then apply some other function after layer two if we have like a multi-layer neural network So normally we define neural networks Normally we define neural networks like so we just say here is a list of the layers we want Right, we don't you don't have to write Your own forward right all we did just now was to say like okay instead of doing this Let's not use any of this at all, but write it all by hand ourselves Right, so you can you can write as many layers as you like in what any order you like here The point was that Here we're not using any of that. We've written our own Matmul plus bias our own Softmax, so this is like this is there's this is just Python code You can write whatever Python code inside forward that you like To define your own neural net. So like you won't normally do this yourself Normally, you'll just use the layers that PyTorch provides and you'll use dot sequential to put them together Or even more likely you'll download a predefined architecture and use that We're just doing this to learn how it works behind the scenes All right great. Thanks everybody