 All right, welcome back to machine learning I I'm really excited to be able to share some amazing stuff that University of San Francisco students have built during the week or written about during the week and Quite a few of the things. I'm going to show you have already spread around the internet quite a bit lots of Tweets and posts and all kinds of stuff happening One of the the first to be widely shared was this one by Tyler who did something really interesting He he started out by saying like why if I like create a synthetic data set where the independent variables is like the x and the y and The dependent variable is like color right and interestingly He showed me an earlier version of this where he wasn't using color he was just like putting the actual numbers in here and This thing kind of wasn't really working at all and as soon as he started using color It started working really well, and so I wanted to mention that one of the things that unfortunately we don't teach you at USF is Theory of human perception perhaps we should because actually when it comes to visualization It's kind of the most important thing to know is what is the human I? Or what is what is what is the human brain good at perceiving there's a whole area of academic study on this And one of the things that we're best at perceiving is differences in color Right, so that's why as soon as we look at this picture of this synthetic data He created you can immediately see oh, there's kind of four areas of you know lighter red color So what he did was he said okay? What if we like tried to create a machine learning model of this synthetic data set? So specifically he created a tree and the cool thing is that you can actually draw The tree right so after he created the tree he did this all in that plot live that plot lead is very flexible, right? He actually drew the tree boundaries So that's already a pretty neat trick is to be actually able to draw the tree But then he did something even cleverer, which is he said okay? So what predictions does the tree make well as the average of each of these areas and so to do that We can actually draw the average color Right, but actually kind of pretty Here is the predictions that the tree makes now Here's where it gets really interesting. It's like you can as you know randomly Generate trees through resampling and so here are four trees Generated through resampling. They're all like pretty similar, but a little bit different And so now we can actually visualize bagging and to visualize bagging. We literally take the average of the four pictures All right, that's what bagging is and There it is right and so here is like the fuzzy decision boundaries of a random forest And I think this is kind of amazing right because it's like a I wish I had this actually when I started teaching you All random forests because I could have skipped a couple of classes. It's just like okay. That's what we do You know, we create the decision boundaries We average each area and then we we do it a few times and average all of them Okay, so that's what a random forest does and I think like this is just such a great example of Making the complex easy through through pictures So congrats to Tyler for that It actually turns out That he has actually reinvented something that somebody else has already done a guy called Christian innie who went on to be a One of the world's foremost machine learning research has actually included almost exactly this technique in a book He wrote about decision forests So it's actually kind of cool that Tyler ended up Reinventing something that one of the world's foremost and for authorities under fifth decision forests actually it has created So I thought that was neat That's nice because when we pop when we posted this on Twitter, you know got a lot of attention and finally somebody with that Were they were able to say like oh, you know what this this actually already exists So Tyler's gone away and you know started reading that book Something else which is super cool is Jason Carpenter Created a whole new library called Parfit and Parfit is a Paralyzed fitting of multiple models for the purpose of selecting hyper-parameters and there's a lot I really like about this He's shown a clear example of how to use it right and like the API looks very similar to other grid search-based approaches But it uses the validation techniques that Rachel wrote about and that we learned about a couple of weeks ago of using a good validation set and You know what he's done here is in his blog post that introduces it, you know, he's he's Gone right back and said like well what are hyper-parameters? Why do we have to train them and he's kind of explained every step and then the the module itself is Like it's very polished, you know, he's added documentation to it. He's added a nice read me to it And it's kind of interesting when you actually look at the code you realize You know, it's very simple, you know, which is it's definitely not a bad thing That's a good thing is to is to make things simple but by kind of Writing this little bit of code and then packaging it up so nicely He's made it really easy for other people to use this technique, which is great. And so one of the things I've been really thrilled to see is then Vinay went along and combined two things from our class one was to take Parfit and then the other was to take the kind of accelerated SGD approach to classification We don't learned about in the last lesson and combine the two to say like, okay Well, let's now use Parfit to help us find the parameters of a SGD logistic aggression So I think that's really a really great idea Something else which I thought was terrific is prints actually basically went through and Summarized pretty much all the stuff we learned in the random in for a random forest interpretation plus And he went even further than that as he described each of the different approaches to random forest interpretation He described how it's done so here for example is feature importance through variable permutation little picture of each one and then super cool here is the code to implement it from scratch So I think this is like really Nice post, you know describing something that not many people understand and showing, you know exactly how it works both with pictures And with code that implements it from scratch So I think that's really really great. One of the things I really like here is that for like the Tree interpreter part he actually showed how you can take the tree interpreter Output and feed it into the new waterfall chart package that Chris our USF student built to show how you can actually visualize The contributions of the tree interpreter in a waterfall chart. So again kind of a nice combination of multiple pieces of technology we both learned about and and built as a group I Also really thought this kernel there's been a few interesting kernels shared and I'll share some more next week Devesh wrote this really nice kernel showing this is quite challenging Kaggle competition on detecting icebergs Versus chips and it's a kind of a weird two-channel satellite data, which is very hard to visualize and he actually Went through and basically described kind of the formulas for how these like radar scattering things actually work And then actually managed to come up with a code that allowed him to recreate You know the actual 3d Icebergs Or ships and I have not seen that done before like I you know It's it's quite challenging to know how to visualize this data And then he went on to show how to build a neural net to try to interpret this So that was pretty fantastic as well So, yeah, congratulations for all of you. I know for a lot of you, you know, you're Posting stuff out there to the rest of the world for the first time, you know And it's kind of intimidating you're used to writing stuff that you kind of hand into a teacher And they're the only ones who see it and You know, it's kind of scary the first time you do it But then the first time somebody, you know, upvotes your Kaggle kernel or adds a clap to your medium post You suddenly realize, oh, I'm actually I've written something that people like that's that's pretty great So if you haven't tried yourself yet, I again invite you to Try writing something and if you're not sure you could write a summary of a lesson You could write a summary of like if there's something you found hard Like maybe you found it hard to fire up a GPU based AWS instance You eventually figure it out. You could write down just describe how you solve that problem Or if one of your classmates didn't understand something and you explained it to them Then you could like write down something saying like oh, there's this concept that some people have trouble understanding Here's a good way. I think of explaining it. There's all kinds of stuff you could you could do Okay, so Let's go back to SGD and so We're going back through this notebook which Rachel put together basically taking us through Kind of SGD from scratch for the purpose of digit recognition And actually quite a lot of the stuff we look at today is gonna be closely following part of the computational linear algebra course Which you can both find the MOOCs on fast AI or at USF. It'll be an elective next year All right, so if you find some of this This stuff interesting and I hope you do then please consider signing up for the elective or checking out the video online So we're building Neural networks and we're starting with an assumption that we've downloaded the MNIST data We've normalized it by subtracting the mean and divided by the standard deviation. Okay, so the data is It's slightly unusual in that although they represent images They were they were downloaded as each image was a seven hundred and eighty four long Rank one tensor, so it's been flattened out Okay, and so for the purpose of drawing pictures of it. We had to resize it the 28 by 28 But the actual data we've got is not 28 by 28. It says it's it's 784 long flattened out Okay The basic steps we're going to take here is to start out with training The world's simplest neural network basically a logistic regression right so no hidden layers and We're going to treat it using a library fast AI and we're going to build the network using a library Pytorch right and then we're going to gradually get rid of all the libraries right so first of all We'll get rid of the NN neural net library and Pytorch and write that ourselves Then we'll get rid of the fast AI fit function and write that ourselves And then we'll get rid of the Pytorch optimizer and write that ourselves and so by the end of This notebook will have written all the pieces ourselves The only thing that we'll end up relying on is the two key things that Pytorch gives us Which is a the ability to write Python code and have it run on the GPU and B the ability to write Python code and have it automatically differentiated for us Okay, so they're the two things we're not going to attempt to write ourselves because it's boring and pointless But everything else we'll try and write ourselves on top of those two things. Okay, so Our starting point is like not doing anything ourselves it's basically having it all done for us and so Pytorch has an NN library, which is where the neural net stuff lives you can create a Multi-layer neural network by using the sequential function and then passing in a list of the layers that you want And we asked for a linear layer Followed by a softmax layer and that defines our logistic regression Okay, the input to our linear layer is 28 by 28 as we just discussed The output is 10 because we want a probability for each of the numbers not through nine for each of our images okay Cuda sticks it on the GPU and Then Fit Fits a model. Okay, so we start out with a random set of weights and then fit uses gradient descent to make it better We had to tell the fit function What criterion to use in other words what counts as better and we told it to use negative log likelihood We'll learn about that in the next lesson what that is exactly We had to tell it what optimizer to use and we said please use opt him Adam the details of that We won't cover in this course. We're going to use something build something simpler called SGD If you're interested in Adam, we just covered that in the dick learning course And what metrics do you want to print out? We decided to print out accuracy. Okay, so That was that and so if we do that Okay So after we fit it we get an accuracy of generally somewhere around 91 92 percent So what we're going to do from here is we're going to gradually we're going to repeat this exact same thing So we're going to rebuild This model You know four or five times fitting it building it and fitting it with less and less libraries Okay, so the second thing that we did last time was to try to start to define the The module ourselves right so instead of saying the network is a sequential bunch of these layers Let's not use that library at all and try and define it ourself from scratch So to do that we have to use Oh, oh because that's how we build everything in pytorch and we have to create a Class which inherits from nn.module so nn.module is a pytorch class That takes our class and turns it into a neural network module Which basically means well anything that you inherit from nn.module like this You can pretty much insert into a neural network as a layer or you can treat it as a neural network It's going to get all the stuff that it needs automatically to to work as a part of or a full Neural network and we'll talk about exactly what that means today and my next lesson right So We need to construct the object so that means we need to define the constructor thunder in it and then importantly This is a Python thing is if you inherit from some other object Then you have to create the thing you inherit from first so when you say super dot under in it That says construct the nn.module piece of that first, right? if you don't do that then The nn.module stuff never gets a chance to actually get constructed now, so this is just like a standard Python OO subclass constructor, okay, and if any of that's an unclear to you then you know This is where you definitely want to just grab a Python intro to OO. This is the standard approach All right, so inside our constructor We want to do the equivalent of nn.linear All right, so what nn.linear is doing is it's taking our It's taking our 28 by 28 Vector so 768 long Vector and we're going to be that's going to be the input to a matrix multiplication. So we now need to create a Something with 768 rows and That's 768 and 10 columns Okay, so because the input to this is Going to be a mini batch of size Actually, that's moved this into a new window 768 by 10 And the input to this is going to be a mini batch of size 64 by 768 Right, so we're going to do this matrix product Okay, so when we say in PyTorch nn.linear It's going to construct This matrix for us right so since we're not using that we're doing things from scratch We need to make it ourselves so to make it ourselves we can say generate normal random numbers with This dimensionality which we passed in here 768 by 10. Okay, so that gives us our randomly initialized matrix Then we want to add on to this You know, we don't just want y equals ax we want y equals ax plus b Right, so we need to add on what we call in neural nets a bias vector So we create here a bias vector of length 10. Okay again randomly initialized And so now here are our two randomly initialized weight tensors So that's our constructor Okay Now we need to define forward. Why do we need to define forward? This is a PyTorch specific thing What's going to happen is this is when you create a module in PyTorch The object that you get back behaves as if it's a function You can call it with parentheses, which we'll do it that in a moment And so you need to somehow define What happens when you call it as if it's a function and the answer is PyTorch calls a method called Forward, okay, that's just that's the Python the PyTorch kind of approach that they picked right So when it calls forward we need to do our actual Calculation of the output of this module or later. Okay So here is the thing that actually gets calculated in our logistic regression so basically we take our Input X Which gets passed to forward that's basically how forward works it gets passed the mini batch and We matrix multiply it by The layer one weights which we defined up here and then we add on The layer one bias which we defined up here. Okay, and actually nowadays we can define this a little bit more elegantly using the Python 3 Matrix multiplication operator, which is the at sign Okay, and when you when you use that I think you kind of end up with Something that looks closer to what the mathematical notation looked like and so I find that nicer Alright, so that's that's our linear layer In our logistic regression in our zero hidden layer neural net and so then the next thing we do to that is Softmax Okay, so we get the output of this Matrix multiply Okay, who wants to tell me what the dimensionality of my output of this matrix model players Sorry 64 by 10. Thank you, Karen. I should mention for those of you that weren't at deep learning class yesterday We actually looked at a really cool post from Karen who described how to Do structured data analysis with neural nets which has been like super popular And a whole bunch of people have kind of said that they've read it and found it super interesting. So That was really exciting So we get this matrix of Outputs and we put this through a softmax And why do we put it through a softmax? We put it through a softmax because in the end we want probably you know for every image We want a probability that's a zero or a one or a two or three or four All right, so we want a bunch of probabilities that add up to one and where each of those probabilities is between zero and one So a softmax Does exactly that for us So for example if we weren't picking out, you know numbers from 0 to 10 But instead we're picking out cat dog play in official building the output of that matrix multiply for one particular image might look like that These are just some random numbers And to turn that into a softmax. I first go e to the power of each of those numbers I Sum up those e to the power ofs and Then I take each of those e to the power ofs and divide it by the sum and that's softmax That's the definition of softmax. So because it was e to the power of it means it's always positive Because it was divided by the sum it means that it's always between zero and one And it also means because it's divided by the sum that they always add up to one So by applying this softmax Activation function so anytime we have a Layer of outputs which we call activations and then we apply some function some nonlinear function to that that maps one One scalar to one scalar like softmax does we call that an activation function Okay, so the softmax activation function takes their outputs and turns it into something which behaves like a probability, right? We don't strictly speaking need it We could still try and train something which where the output directly is the probabilities All right, but by creating using this function that automatically makes them always behave like probabilities It means there's less for the network to learn so it's going to learn better, right? So generally speaking whenever we design an architecture We try to design it in a way where it's as easy as possible For it to create something of the form that we want So that's why we use softmax All right, so that's the basic steps, right? We have our input which is a bunch of images All right, which is here gets multiplied by a weight metrics. We actually also add on a bias right to get a Output of the linear function we put it through a nonlinear activation function in this case softmax and that gives us our probabilities So there there that all is Hi torch also tends to use the log of Softmax for reasons that don't particularly need bother us now. It's basically a numerical stability convenience Okay, so to make this the same as our Version up here that use log softmax. I'm going to use log here as well. Okay, so We can now instantiate this class that has created an object of this class So I have a question back for the probabilities where we were before so If we were to have a photo with a cat and a dog together Would that change the way that that works or does it work in the same basic? Yeah, so that's a great question. So if you had a photo with a cat and a dog together and You wanted it to spit out both cat and dog This would be a very poor choice. So softmax is specifically the activation function we use for Categorical predictions where we only ever want to predict one of those things, right? And so part of the reason why is that as you can see because we're using either there, right? Either the slightly bigger numbers creates much bigger numbers as a result of which we generally have just one or two Things large and everything else is pretty small, right? So if I like recalculate these random numbers a few times You'll see like it tends to be a bunch of zeros and one or two high numbers, right? So it's really designed to try to kind of Make it easy to predict like this one thing is the thing I want If you're doing multi Label prediction, so I want to find all the things in this image rather than using softmax We would instead use sigmoid that's a sigmoid recall each would cause each of these between to be between zero and one But they would no longer add to one It's a good question and like a lot of these Details about like best practices are things that we cover in the deep learning course And we won't cover heaps of them here in the machine learning course. We're more interested in the mechanics. I guess But we'll try and do them with their quick All right, so now that we've got that we can instantiate an object of that class and of course We want to copy it over to the GPU so we can do computations over there Again, we need an optimizer. We'll be talking about what this is shortly But you'll see here we've called a function on our class called parameters But we never defined a method called parameters And the reason that is going to work is because it actually was defined for us inside nn.module And so nn.module actually automatically goes through the attributes we've created and Finds anything that basically we we said this is a parameter So the way you say something is a parameter is you wrap it in nn.parameter So this is just the way that you tell pi torch This is something that I want to optimize Okay, so when we created the weight matrix, we just wrapped it with nn.parameter It's exactly the same as a regular Pi torch variable which we'll learn about shortly It's just a little flag to say hey you should you should optimize this And so when you call net2.parameters on our net2 object we created it goes through everything that we created in the constructor checks to see if any of them are of type parameter and if so It sets all of those being things that we want to train with the optimizer and We'll be implementing the optimizer from scratch later Okay, so having done that We can fit And we should get basically the same answer as before 91-ish So that looks good all right So What if we actually built here well what we've actually built as I said is something that can behave Like a regular function right so I want to show you how we can actually call this as a function So to be able to call it as a function We need to be able to pass data to it to be able to pass data to it I'm going to need to grab a mini batch of MNIST images Okay, so we used for convenience the Image classifier data from arrays method from fast AI And what that does is it creates a pi torch data loader for us So pi torch data loader is something that grabs a few images and sticks them into a mini batch And makes them available and you can basically say give me another mini batch. Give me another mini batch. Give me another mini batch and so In Python we call these things generators generators are things where you can basically say I want another I want another I want another right There's this kind of very close connection between iterators and generators are not going to worry about the difference between them right now But you'll see basically to turn To to actually get hold of something which we can say please give me another of In order to grab something that we can we can use to generate mini batches We have to take our data loader and so you can ask for the training data loader from our model data object You'll see there's a bunch of different data loaders you can ask for you can ask for the test data loader the train data loader the validation loader augmented images data loader and so forth, so we're going to grab the training data loader That was created for us. This is a pie standard pie torch data loader. Well slightly optimized by us, but same idea And you can then say this is a standard Python Thing we can say turn that into an iterator turn that into something where we can grab another one at a time from and so once You've done that We've now got something that we can iterate through you can use the standard Python Next function to grab one more thing from that generator. Okay So that's returning and at the X's from our mini batch and the Y's from our mini batch The other way that you can use Generators and iterators in Python is with a for loop. I could also have said like for You know X mini batch comma Y mini batch in data loader And then like do something right so when you do that it's actually behind the scenes It's basically syntactic sugar for calling next lots of times. Okay, so this is all standard Python stuff So that returns A tensor of size 64 by 784 as we would expect right the The fast AI library we used defaults to a mini batch size of 64. That's why it's that long These are all of the background zero pixels, but they're not actually zero in this case. Why aren't they zero? Yeah, they're normalized exactly right so we subtracted the mean divided by standard deviation right So there there it is so now what we want to do is we want to Pass that into our our logistic regression. So what we might do is we'll go Variable X MB equals variable. Okay, I can take my X mini batch. I can move it on to the GPU because remember my Net to object is on the GPU. So our data for it also has to be on the GPU And then the second thing I do is I have to wrap it in variable. So what does variable do? This is how we get for free automatic differentiation Pi torch can automatically differentiate You know pretty much anything right any tensor But to do so takes memory and time So it's not going to always keep track like to do automatic differentiation It has to keep track of exactly how something was calculated. We added these things together We multiplied it by that we then took the sign blah blah blah right you have to know all of the steps because then to do the automatic differentiation it has to Take the derivative of each step using the chain rule multiply them all together Right, so that's slow and memory intensive. So we have to opt in to saying like okay this particular thing We're going to be taking the derivative of later So please keep track of all of those operations for us And so the way we opt in is by wrapping a tensor in a variable, right? So That's how we do it and You'll see that it looks almost exactly like a tensor, but it now says variable containing This tensor right so in pi torture variable has exactly Identical API to a tensor. We're actually more specifically a superset of the API of a tensor Anything we can do to a tensor we can do to a variable But it's going to keep track of exactly what we did so we can later on take the derivative Okay, so we can now pass that Into our net to object and remember I said you can treat this as if it's a function Right, so notice. We're not calling dot forward We're just treating it as a function and Then remember we took the log so to undo that I'm taking the X and that will give me my probabilities Okay, so there's my probabilities and it's got Returns something of size 64 by 10. So for each image in the mini batch We've got 10 probabilities and you'll see most probabilities are pretty close to zero Right and a few of them are quite a bit bigger Which is exactly what we do. We would hope right is that it's like, okay? It's not a zero. It's not a one. It's not a two. It is a three. It's not a four It's not a five and so forth So maybe this would be a bit easier to read if we just grab like the first three of them Okay, so it's like ten to the neg three ten to the neg eight two five five four Okay, and then suddenly here's one which is ten to make one, right? So you can kind of see what it's trying to What it's trying to do here? I mean we could call like net to dot forward and it'll do exactly the same thing right, but that's not how All of the pie torch mechanics actually work It's actually they actually call it as if it's a function, right? And so this is actually a really important idea like because it means that When we define our own Architectures or whatever anywhere that you would put in a function You could put in a layer anywhere you put in a layer you can put in a neural net anywhere You put in a neural net you can put in a function because as far as pie torch is concerned They're all just things that it's going to call just like as if they're functions So they're all like interchangeable and this is really important Because that's how we create Really good neural nets is by mixing and matching lots of pieces and putting them all together All right, let me give you an example here is my Logistic regression which got 91 and a bit percent accuracy I'm now going to turn it Into a neural network with one hidden layer All right, and the way I'm going to do that is I'm going to create One more layer I'm going to change this so it spits out a hundred rather than ten Which means this one input is going to be a hundred rather than ten Now this as it is can't possibly make things any better at all yet Why is this definitely not going to be better than what I had before? Yeah, can somebody pass the I Wish you've got a combination of two linear layers, which is the same as one linear layer with different parameters exactly right So we've got two linear layers, which is just a linear layer, right? So to make things interesting I'm going to replace all of the negatives from the first layer with zeros Because that's a nonlinear transformation and so that nonlinear transformation is called a rectified linear unit Okay, so nn dot sequential simply is going to call each of these layers in turn for each mini batch All right, so do a linear layer Replace all of the negatives with zero do another linear layer and do a softmax. This is now a neural network with one hidden layer and So let's try training that instead Yeah, accuracy is now going up to 96% Okay, so that this is the idea is that the basic techniques we're learning in this lesson Like become powerful at the point where you start stacking them together. Okay Can somebody pass the green box there and then there yes Daniel Why did you pick a hundred no reason it was like easier to type an extra zero Like this question of like how many activations should I have a neural network layer is kind of part of the Scale of a deep learning practitioner. We cover it in the deep learning course not in this course Um when adding that additional I guess transformation Additional layer additional layer. This one here is called a nonlinear layer or an activation function You said activation layer activation function or activation activation function Does it matter that like if you would have done for example like two softmaxes or is that something you cannot do Like no, you can absolutely use a softmax there But it's probably not going to give you what you want and the reason why is that a softmax Tends to push most of its activations to zero and an activation just to be clear Like I've had a lot of questions in deep learning course about like what's an activation an activation is the value That is calculated in a layer, right? So this is an activation Right, it's not a weight a weight is not an activation It's the value that you calculate from a layer So softmax will tend to make most of its activations pretty close to zero and that's the opposite of what you want You generally want your activations to be kind of as rich and diverse and and used as possible So nothing to stop you doing it, but it probably won't work very well basically Pretty much all of your layers will be followed by None by non-linear activation functions that will nearly always be real you Except for the last layer Could you when doing multiple layers, so let's say like could you live three could you live going two or three layers deep? Do you want to switch up these activation layers? No That's a great question. So if I wanted to go deeper, I would just do That okay, that's a now two hidden layer network So I think I heard you said that there are a couple of different activation functions like that rectifies linear unit What are some examples and? Why would you use? each Yeah, great question So basically like as you add like more Linear layers you kind of got your input comes in and you put it through a linear layer and then a non-linear layer linear layer nonlinear layer many linear layer and Then the final non-linear layer The final non-linear layer as we've discussed, you know if it's a Multi-category classification, but you only ever pick one of them. You would use softmax If it's a binary classification or a multi label classification where you're predicting multiple things you would use sigmoid If it's a regression You would often have Nothing at all right, although we learned in last night's deal course where sometimes you can use sigmoid there as well So they're basically the options main options for the final layer for the Hidden layers you pretty much always use Relu, but there is a another Another one you can pick which is kind of interesting which is called Leaky value and it looks like this and Basically if it's above zero, it's y equals x and if it's below zero. It's like y equals 0.1 x Okay, so it's very similar to value, but it's You know rather than being equal to zero under x. It's it's like something close to that So they're the main two value and Leaky value There are various others But they're kind of like things that just look very close to that So for example, there's something called ELU, which is quite popular But like, you know the details don't matter too much honestly like that They're like ELU is something that looks like this, but it's slightly more curvy in the middle And it's kind of like it's not generally something that you so much pick based on the data set it's more like Over time we just find better activation functions So two or three years ago everybody used value, you know a year ago pretty much everybody used leaky value today I guess probably most people are starting to move towards ELU But honestly the choice of activation function doesn't matter Terribly much actually And you know people have actually showed that you can use like a pretty arbitrary nonlinear activation functions like even a sine wave It still works Okay So although what we're going to do today is showing how to create This network with no hidden layers To turn it into That network Which is 96% ish accurate is it will be trivial right and in fact is something you should Probably try and do during the week right is to create that version Okay So now that we've got something where we can take our network pass in our variable and get back some predictions That's basically all that happened when we called fit so we're going to see how how that That approach can be used to create this stochastic gradient descent One thing to note is that the to turn the predicted probabilities Into a predicted like which digit is it we would need to use arg max Unfortunately pie torch doesn't call it arg max Instead pie torch just calls it max and max returns two things Returns the actual max across this axis. So this is across the columns, right and the second thing it returns is the index Over that maximum right so so the equivalent of arg max is to call max and then get the first Indexed thing okay, so there's our predictions right if this was in numpy we would instead use NP dot arg max Okay All right So here are the predictions from our hand created logistic regression and in this case Looks like we got all but one correct So the next thing we're going to try and get rid of in terms of using libraries is we'll try to avoid using the matrix Multiplication operator and instead we're going to try and write that by hand So this next part we're going to learn about something which kind of seems It kind of it's going to seem like a minor little kind of programming idea, but actually it's going to turn out That at least in my opinion, it's the most important Programming concept that we'll teach in this course, and it's possibly the most important programming con kind of concept in all of All the things you need to build machine learning algorithms, and it's the idea of broadcasting And the idea I will show by example if we create an array of 10 6 neg 4 and an array of 2 8 7 and then add the 2 together It adds each of the components of those two arrays in turn we call that element wise So in other words, we didn't have to write a loop right back in the old days We would have to have looped through each one and added them and then concatenated them together We don't have to do that today. It happens for us automatically So in NumPy, we automatically get element wise operations We can do the same thing With PyTorch So in fast AI, we just add a little capital T to turn something into a PyTorch tensor All right, and if we add those together Exactly the same thing. All right, so element wise operations are pretty standard in these kinds of libraries It's interesting not just because we don't have to write the for loop Right, but it's actually much more interesting because of the performance things that are happening here The first is if we were doing a for loop right if we were doing a for loop that would happen in Python Right, even when you use PyTorch, it still does the for loop in Python It has no way of like optimizing a for loop and so a for loop in Python is something like 10,000 times slower than in C So that's your first problem. I can't remember. It's like 1,000 or 10,000 the second problem then is that You don't just want it to be optimized in C But you want C to take advantage of the thing that your all of your CPUs do to something called SIMD single instruction multiple data, which is it your CPU is capable of taking eight things at a time right in a vector and adding them up to another Vector with eight things in in a single CPU instruction Right, so if you can take advantage of SIMD, you're immediately eight times faster It depends on how big the data type is it might be four might be eight The other thing that you've got in your computer is you've got multiple processes multiple cores So you've probably got like if this is inside tapping on one side one core you've probably got about four of those Okay, so if you're using SIMD you're eight times faster if you can use multiple cores Then you're 32 times faster and then if you're doing that in C You might be something like 32 times thousand times faster, right? And so the nice thing is that when we do that It's taking advantage of all of these things Okay, better still if you do it in PyTorch and your data was created with Doc Cuda to stick it on the GPU Then your GPU can do about ten thousand things at a time Right, so that'll be another hundred times faster than C All right, so this is critical to getting good performance is you have to learn how to write loopless code By taking advantage of these element wise Operations and like it's not it's a lot more than just plus I Could also use less than Right, and that's going to return Zero one one or if we go back to NumPy false true true And so you can kind of use this to do all kinds of things without looping So for example, I could now multiply that by a and here are all of the values of a As long as they're less than B or We could take the mean This is the percentage of values in a that are less than B All right, so like there's a lot of stuff you can do with this this simple idea But to take it further Right to take it further than just this element wise operation We're going to have to go the next step to something called broadcasting So let's take a five-minute break come back at 217 and we'll talk about broadcasting So Broadcasting This is the definition from the NumPy documentation of Broadcasting and I'm going to come back to it at a moment rather than reading it now But let's start By looking an example of broadcasting So a is a Array With one dimension also known as a rank one tensor also known as a vector All right, we can say a greater than zero So here we have a Rank one tensor Right and a rank zero tensor Right a rank zero tensor is also called a scalar The rank one tensor is also called a vector and We've got an operation between the two Right now you've probably done it a thousand times without even noticing that's kind of weird right that you've got these things of different Ranks and different sizes. So what is it actually doing right? But what it's actually doing is it's taking that scalar and copying it here here here Right and then it's actually going element wise Ten is greater than zero Six is greater than zero minus four is greater than zero having giving us back the three answers All right, and that's called broadcasting broadcasting means Copying one or more axes of my tensor To allow it to be the same shape as the other tensor It doesn't really copy it though What it actually does is It stores this kind of internal indicator that says pretend that this is a vector of three zeros But it actually just like rather than kind of going to the next row or going to the next scalar It goes back to where it came from if you're interested in learning about this specifically It's they set the stride on that axis to be zero. That's a minor advanced concept for those who are curious So we could do a Plus one right it's going to broadcast the scalar one To be one one one and then do element wise addition We could do the same with a matrix, right? Here's our matrix two times the matrix is going to broadcast two to be two two two two two two two two two two and Then do element wise Modification all right, so that's our kind of most simple version of Broadcasting so here's a slightly more complex version of broadcasting Here's an array called C. All right, so this is a rank one tensor and Here's our matrix M from before Our rank two tensor we can add M plus C Right, so what's going on here? One two three four five six seven eight nine That's M right, and then C ten twenty Thirty You can see that what it's done is to add that to each row right eleven twenty two thirty three fourteen twenty five thirty six and So we can kind of figure it seems to have done the same kind of idea as broadcasting a scalar It's like made copies of it, and then it treats those as If it's a ranked two matrix and now we can do element wise addition That makes sense right now that's yes, can can you pass that Devin over there? Thank you So as it's like by looking at this example it like copies it down Making new rows So how would we want to do it if we wanted to get new columns? I'm so glad you asked So instead We would do this ten twenty thirty All right, and then copy that ten twenty thirty ten twenty thirty and Now treat that as our matrix So to get numpy to do that we need to not pass in a vector but to pass in a Matrix with one column right a rank two tensor right So basically it turns out that numpy is going to think of a Rank one tensor for these purposes as if it was a rank two tensor which represents a row Right, so in other words that it is one by three Right, so we want to create a tensor which is three by one There's a couple of ways to do that One is to use NP dot expand dims and if you then pass in this argument It says please insert a length one axis here, please so in our case we want to turn it into a Three by one, so if we said expand in C comma one Okay, so if we say expand in C comma one it changes the shape to three comma one So if we look at what that looks like That looks like a column. Okay, so if we now go that plus M You can see it's doing exactly what we hoped it would do All right, which is to add 10 20 30 to the column 10 20 30 to the column 10 20 30 to the column Okay Now because the location of a unit axis turns out to be so important It's really helpful to kind of experiment with creating these extra unit axes and know how to do it easily and NP dot expand dims isn't in my opinion the easiest way to do this the easiest way The easiest way is to index into the tensor with a special Index none and what none does is it creates a new axis in that location of length one right, so this is Going to add a new axis at the start of length one This is going to add a new axis at the end of length one or Why not do both? Right, so if you think about it like a tensor Which has like three? Things in it could be of any rank you like right you can just add unit axes all over the place and so that way we can kind of Decide how we want our broadcasting to work So there's a pretty convenient Thing in NumPy called broadcast to and what that does is it takes our vector and Broadcasts it to that shape and shows us what that would look like right? So if you're ever like unsure of what's going on in some broadcasting operation You can say broadcast to and so for example here we could say like rather than three comma three We could say m dot shape Right and see exactly what's happening going to happen and so that's what's going to happen before we add it to m right, so if we said Turn it into a column that's what that looks like Makes sense. So that's kind of like the intuitive definition of Broadcasting and so now hopefully we can go back to that NumPy documentation and understand What it means right? Broadcasting describes how NumPy is going to treat arrays of different shapes when we do some operation Right the smaller array is broadcast across the larger array by smaller array. They mean lower rank Tensor basically Broadcast across the large the higher-rank tensor so they have compatible shapes it vectorizes array operations So vectorizing generally means like using SIMD and stuff like that so that multiple things happen at the same time All the looping occurs in C But it doesn't actually make needless copies of data. They're kind of just acts as if it had So there's our definition Now in deep learning you very often deal with tensors of rank four or more and You very often combine them with tensors of rank one or two and Trying to just rely on intuition to do that correctly is nearly impossible So you really need to know the rules So here are the rules Okay, here's m dot shape here's C dot shape So the rule are that we're going to compare the shapes of our two tensors element-wise We're going to look at one at a time And we're going to start at the end right so look at the trailing dimensions and Then go Towards the front okay, and so two dimensions are going to be compatible When one of these two things is true All right, so let's check right. We've got our M and C compatible M is three three C is three All right, so we're going to start at the end trailing dimensions first and check are they compatible They're compatible if the dimensions are equal, okay So these ones are equal so they're compatible right Let's go to the next one. Oh, oh, we're missing Right C is missing something. So what happens if something is missing as we insert a one Okay, that's the rule right and so let's now check are these compatible one of them is one yes, they're compatible Okay, so now you can see why it is that numpy treats the one dimensional array as If it is a rank two tensor Which is representing a row. It's because we're basically inserting a one at the front Okay, so that's the rule so for example This is something that you very commonly have to do which is you start with like an Image they're like 256 pixels by 256 pixels by three channels and you want to subtract The mean of each channel Right, so you've got 256 by 256 by 3 and you want to subtract something of length 3 right so yeah You can do that Absolutely because 3 and 3 are compatible because they're the same All right 256 and empty is compatible. It's going to insert a one 256 and empty is compatible. It's going to insert a one right, so you're going to end up with This is going to be broadcast over All of this axis and then that whole thing will be broadcast over this axis and so we'll end up with a 256 by 256 by 3 Effective tensor here right so interestingly like Very few people in the data science or machine learning communities Understand broadcasting and the vast majority of the time for example when I see people doing Pre-processing for computer vision like subtracting the mean they always write loops over the channels right and I kind of think like It's it's like so handy to not have to do that and it's often so much faster to not have to do that So if you get good at broadcasting you'll have this like super useful skill that very very few people have and Like it's it's it's an ancient skill. You know it goes it goes all the way back to the days of APL So APL was from the late 50s stands for our programming language and Kenneth Iverson wrote this paper called Notation as a tool for thought in which he proposed a new math notation and He proposed that if we use this new math notation It gives us new tools for thought and allows us to think things we couldn't before and one of his ideas was broadcasting not as a computer programming tool, but as a piece of math notation and So he ended up implementing This notation as a tool for thought as a programming language called APL and His son has gone on to further develop that Into a piece of software called J Which is basically what you get when you put 60 years of very smart people working on this idea And with this programming language you can express Very complex mathematical ideas often just with a line of code or two And so I mean it's great that we have J But it's even greater that these ideas have found their ways into the languages We all use like in Python the NumPy and PyTorch libraries. All right, these are not just little Kind of niche ideas. It's like fundamental ways to think about math and to do programming All right, let me give an example of like this kind of notation as a tool for thought Let's Let's look here. We've got C. Right here. We've got C None right notice this is now a two square brackets, right? So this is kind of like a one row rank two tensor Here it is a little column. So what is just round ones Okay, what's that going to do? Have a think about it. Anybody want to have a go even talk through your thinking. Okay, can we pass the check just over there? Thank you Kind of outer product. Yes, absolutely. So take us through your thinking. How's that going to work? So The diagonal elements can be directly visualized from the squares Mm-hmm cross 10 20 cross 20 and 30 cross 30 Mm-hmm and If you multiply the first row with this column, you can get the first row of the matrix So finally we'll get a three cross three matrix. Yeah, and so to think of this in terms of like those broadcasting rules We're basically taking This column right which is of rank 3 comma 1 right and this kind of row Sorry, I've mentioned 3 comma 1 and this row which is of dimension 1 comma 3 Right and so to make these compatible with our broadcasting rules Right this one here has to be duplicated Three times because it needs to match this Okay, and now this one's going to have to be duplicated three times to match this Okay, and so now I've got two Matrices to do an element-wise product of and so as you say There is our outer product right now the interesting thing here is That suddenly now that this is not a special mathematical case but just a Specific version of the general idea of broadcasting we can do like an outer plus Or we can do an outer greater than Right or or whatever right so it suddenly we've kind of got this this this concept that we can use to build new ideas and then we can start to experiment with those new ideas and so you know interestingly numpy actually Uses this sometimes For example, if you want to create a grid This is how numpy does it Right, actually this is kind of the sorry. Let me show you this way If you want to create a grid, this is how numpy does it it actually returns zero one two three four and Zero one two three four One is a column one is a row So we could say like okay, that's x grid comma y grid and now you could do something like Well, I mean we could obviously go like that Right, and so suddenly we've expanded that out into a grid right and so Yeah, it's kind of interesting how like Some of these like simple little concepts Kind of get built on and built on and built on so if you lose something like APL or J. It's this whole Environment of layers and layers and layers of this we don't have such a deep environment in numpy But you know you can certainly see these ideas of like broadcasting Coming through in simple things like how do we create a grid in in numpy? So yeah, so that's That's broadcasting and so what we can do with this now is Use this to implement matrix multiplication ourselves Okay Now why would we want to do that? Well, obviously we don't right matrix multiplication has already been handled Perfectly nicely for us by our libraries but very often you'll find in All kinds of areas in in machine learning and particularly in deep learning that there'll be particular types of linear Function that you want to do that aren't quite Done for you. All right, so for example, there's like whole areas called like tensor regression and tensor decomposition Which are really being developed a lot at the moment and they're kind of talking about like how do we take like Higher-rank tensors and kind of turn them into combinations of rows columns and faces and It turns out that when you can kind of do this you can basically like deal with really high dimensional data structures with not much memory And not with not much computation time. For example, there's a really terrific library called tensily Which does a whole lot of this kind of stuff? For you So it's a really really important area it covers like all of deep learning lots of modern machine learning in general And so even though you're not going to like define matrix multiplication. You're very likely to want to define some other slightly different tensor product, you know So it's really useful to kind of understand how to do that So let's go back and look at our matrix and our our 2d array and 1d array rank 2 tensor rank 1 tensor and Remember we can do a matrix multiplication using the at sign or the old way NP dot mat mole. Okay, and so what that's actually doing when we do that is we're basically saying Okay, one times 10 plus Two times 20 plus three times 30 is 140 All right, and so we do that for each row and We can go through and do the same thing for the next one and for the next one to get our result, right? You could do that in torch as well We could make this a little shorter, okay same thing Okay, but that is not matrix multiplication What's that? Okay element-wise specifically we've got a matrix and a vector so Broadcasting okay good. So we've got this is element-wise with broadcasting, but notice The numbers it's created 10 40 90 are the exact three numbers that I needed to Calculate when I did that first piece of my matrix multiplication So in other words if we sum this Over the columns Which is axis equals one we get our matrix extra product Okay So we can kind of do this stuff without special help from our library So now Let's expand this out to a matrix matrix product So a matrix matrix product Looks like this. This is this great site called matrix multiplication dot x y z And it shows us this is what happens when we multiply two matrices. Okay, that's what matrix multiplication is Operationally speaking so in other words what we just did there Was we first of all took the first column with the first row to get this one and Then we took the second column with the first row to get that one. All right, so we're basically doing The thing we just did the matrix vector product. We're just doing it twice right once With this column and once with this column and then we concatenate the two together Okay, so we can now Go ahead and do that like so M times the first column that's some M times the second top column that's some and so there are the two columns of our matrix multiplication Okay So I didn't want to like Make our code too messy. So I'm not going to actually like use that But like we have it there now if we want to we don't need to use Torch or numpy matrix multiplication anymore We've got we've got our own that we can use using nothing but element-wise operations broadcasting and some okay So this is our Logistic regression from scratch class again. I just copied it here Here's where we instantiate the object copy to the GPU We create an optimizer which we'll learn about in a moment and we call fit Okay, so the goal is to now repeat this without needing to call fit so to do that We're going to need a loop Which grabs a mini batch of data at a time and with each mini batch of data We need to pass it to the optimizer and Say please try to come up with a slightly better set of predictions for this mini batch All right, so as we learned in order to grab a mini batch of the training set at a time We have to ask the model data object for the training data loader We have to wrap it in iterator to create an iterator or a generator And so that gives us our our data loader. Okay, so pie torch calls this a data loader We actually wrote our own fast AI data loader, but it's it's all it's basically the same idea and So the next thing we do is we grab the x and the y tensor the next one from our data loader Wrap it in a variable to say I need to be able to take the derivative of The calculations using this because if I can't take the derivative Then I can't get the gradients and I can't update the weights All right, and I need to put it on the GPU because my module is on the GPU and So we can now take that variable and pass it to The object that we instantiated our logistic regression Remember our module we can use it as if it's a function because that's how pie torch works and That gives us a set of predictions as we saw and seen before Okay so now we can check the loss and the loss we defined as being a Negative log likelihood loss object, okay And we're going to learn about how that's calculated in the next lesson and for now think of it just like root mean squared error But for classification problems So we can call that also just like a function So you can kind of see this it's very general idea in pie torch that you know kind of treat everything Ideally like it's a function. So in this case we have a loss Negative log likelihood loss object. We could treat it like a function. We pass in our predictions And we pass in our actuals right and again the actuals need to be turned into a variable and put on the GPU Because the loss is specifically the thing that we actually want to take the derivative of right? So that gives us our loss And there it is that's our loss 2.43 Okay, so it's a variable and Because it's a variable it knows how it was calculated All right, it knows it was calculated with this loss function. It knows that the predictions were calculated with this Network it knows that this network consisted of these operations and so we can get the gradient automatically, right So to get the gradient We call L dot backward remember L is the thing that contains our loss All right, so L dot backward is is something which is added to anything That's a variable you can call dot backward and that says please calculate the gradients Okay, and so that calculates the gradients and stores them inside that that The basically for each of the Weights that was used it used each of the parameters that was used to calculate that. It's now stored A dot grad we'll see it later. It's basically stored the gradient right so we can then call Optimizer dot step and we're going to do this step manually shortly And that's the bit that says please make the weights a little bit better, right? And so what optimizer dot step is doing? Is it saying like okay if you had like a really simple function like this? All right, then what the optimizer does is it says, okay, let's pick a random starting point Right, and let's calculate the value of the loss, right? So here's our parameter Here's our loss right. Let's take the derivative All right the derivative tells us which way is down So it tells us we need to go that direction Okay, and we take a small step and Then we take the derivative again, and we take a small step derivative again take a small step Drift it again take a small step Until eventually we're taking such small steps that we stop. Okay, so that's what gradient descent does How big a step is a small step? Well, we basically take the derivative here. So let's say derivative there is like eight All right, and we multiply it by a small number like say 0.01 and that tells us what step size to take This small number here is called the learning rate and It's the most important hyperparameter to set right if you pick two smaller learning rate Then your steps down are going to be like tiny and it's going to take you forever Okay to bigger learning rate and you're jumped too far Right, and then you're jumped too far and you're diverged rather than converge, okay We're not going to talk about how to pick a learning rate in this class But in the deep learning class we actually show you a specific technique that very reliably picks a very good learning rate So that's basically what's happening, right? So we calculate the derivatives and we call the optimizer that does a step in other words update the weights based on the Gradients and the learning rate We should hopefully find that after doing that we have a better loss than we did before So I just re-ran this and got a loss here of 4.16 and After one step it's now 4.03. Okay, so it worked the way we hoped it would based on this mini batch it updated all of the weights in our Network to be a little better than they were as a result of which our loss went down, okay So let's turn that into a training loop All right, we're going to go through a hundred steps Grab one more mini batch of data from the data loader Calculate our predictions from our network Calculate our loss from the predictions and the actuals Every ten goes we'll print out the accuracy just take the mean of the whether they're equal or not one PyTorch specific thing you have to zero the gradients basically you can have networks where like you've got lots of different loss functions that you Might want to add all of the gradients together Right, so you have to tell PyTorch like when to set the gradients back to zero, right? So this just says set all the gradients to zero Calculate the gradients that's caught backward and then take one step of the optimizer So update the weights using the gradients and the learning rate And so once we run it you can see the loss goes down and The accuracy goes up Okay, so That's the basic approach and so next lesson we'll see What that does all right? We're looking in detail We're not going to look inside here as I say we're going to basically take the calculation of the derivatives as As a given right, but basically What's happening there and any kind of deep network you have kind of like a function? that's like you know a linear function and Then you pass the output of that into another function that might be like a ReLU and you pass the output of that into another Function that might be another linear linear layer you pass that into another function That might be another ReLU and so forth Right so these these deep networks are just functions of functions of functions so you could write them mathematically like that right and so All back prop does is it says let's just simplify this down to the two version is we can say okay u equals f of x Right and so therefore the derivative of g of f of x is we can calculate with the chain rule as being G dash u f dash x Right and so you can see we can do the same thing for the functions of the functions of the functions and so when you apply a Function to a function of a function you can take the derivative just by taking the product of the derivatives of each of those Layers okay, and in neural networks. We call this back propagation Okay, so when you hear back propagation it just means use the chain rule to calculate the derivatives And so when you see a neural network defined like here right Like if it's defined sequentially literally all this means is Apply this function to the input Apply this function to that apply this function to that apply this function to that right so this is just defining a composition of a function to a function to a function to a function Okay, and so Yeah, so although we're not going to bother with calculating the gradients ourselves You can now see why it can do it right as long as it has internally You know a it knows like what's the what's the derivative of to the power of what's the derivative of sign What's the derivative of plus and so forth then our Python code in Here it's just combining those things together So it just needs to know how to compose them together with the chain rule And away it goes. Okay Okay, so I think we can leave it there for now and yeah, and in the next class We'll go and we'll see how to write our own Optimizer and then we'll have solved MNIST from scratch ourselves. See you then