 So welcome back to part two of what previously was called practical deep learning for coders, but part two is not called that, as you will see. It's called deep learning from the foundations. It's lesson eight because it's lesson eight of the full journey, lesson one of part two, or lesson eight, mod seven, as we sometimes call it. For those of you, I know a lot of you kind of do every year's course and keep coming back. For those of you doing that, this will not look at all familiar to you. It's a very different kind of part two. We're really excited about it and hope you like it as well. The basic idea of deep learning from the foundations is that we are going to implement much of the fast AI library from foundations. And I'll talk about exactly what I mean by foundations in a moment, but it basically means from scratch. So we'll be looking at basic matrix calculus and creating a training loop from scratch and creating a optimizer from scratch and lots of different layers and architectures and so forth. And not just to create some kind of dumbed down library that's not useful for anything, but to actually build from scratch something you can train cutting edge world class models with. So that's the goal. We've never done it before. I don't think anybody's ever done this before. So I don't exactly know how far we'll get, but this is the journey that we're on. We'll see how we go. So in the process, we will be having to read and implement papers, because the fast AI library is full of implemented papers. So you're not gonna be able to do this if you're not reading and implementing papers. Along the way, we'll be implementing much of PyTorch as well as you'll see. We'll also be going deeper into solving some applications that are not kind of fully baked into the fast AI library yet. So we're gonna require a lot of custom work. So things like object detection, sequence to sequence with attention, transformer and the transformer Excel, cycle GAN, audio stuff like that. We'll also be doing a deeper dive into some performance considerations like doing distributed multi GPU training, using the new just in time compiler, which we just called JIT from now on. Coder and C++ stuff like that. So that's the first five lessons. And then the last two lessons, implementing some subset of that in Swift. So this is otherwise known as impractical deep learning for coders. Because really none of this is stuff that you're gonna go and use right away. It's kind of the opposite of part one. Part one was like, oh, we've been spending 20 minutes on this. You can now create a world class vision classification model. This is not that, right? Cause you already know how to do that. And so back in the earlier years, part two used to be more of the same thing, but it was kind of like more advanced types of model, more advanced architectures. But there's a couple of reasons we've changed this year. The first is so many papers come out now because this whole area has increased in scale so quickly that I can't pick out for you the 12 papers to do in the next seven weeks that you really need to know. Cause there's too many. And it's also kind of pointless, right? Because once you get into it, you realize that all the papers pretty much say minor variations on the same thing. So instead, what I wanna be able to do is show you the foundations that lets you read the 12 papers you care about and realize like, oh, that's just that thing with this minor tweak. And I now have all the tools I need to implement that and test it and experiment with it. So that's kind of a really key issue in why we wanna go in this direction. Also it's increasingly clear that we used to call part two cutting edge deep learning for coders, but it's increasingly clear that the cutting edge of deep learning is really about engineering, not about papers. The difference between really effective people in deep learning and the rest is really about who can make things in code that work properly. And there's very few of those people. So really the goal of this part too is to deepen your practice so you can understand the things that you care about and build the things you care about and have them work and perform at a reasonable speed. So that's where we're trying to head to. And so it's impractical in the sense that none of these are things that you're gonna go probably straight away and say, here's this thing I built. Particularly Swift, because Swift we're actually gonna be learning a language in a library that as you'll see is far from ready for use. And I'll describe why we're doing that in a moment. So part one of this course was top-down, right? So that you got the context you needed to understand, you got the motivation you needed to keep going and you got the results you needed to make it useful. But bottom-up is useful too. And we started doing some bottom-up at the end of part one, right? But really bottom-up lets you, when you've built everything from the bottom yourself, then you can see the connections between all the different things. You can see they're all variations of the same thing. And then you can customize, rather than picking algorithm A or algorithm B, you create your own algorithm to solve your own problem, doing just the things you need it to do. And then you can make sure that it performs well, that you can debug it, profile it, maintain it because you understand all of the pieces. So normally when people say bottom-up in this world, in this field, they mean bottom-up with math. I don't mean that. I mean bottom-up with code, right? So today, step one will be to implement matrix multiplication from scratch in Python. It's because bottom-up with code means that you can experiment really deeply on every part of every bit of the system. You can see exactly what's going in, exactly what's coming out and you can figure out why your model's not training well or why it's slow or why it's giving the wrong answer or whatever. So why Swift? What are these two lessons about? And be clear, we are only talking the last two lessons, right? Our focus, as I'll describe, is still very much Python and PyTorch, right? But there's something very exciting going on. The first exciting thing is this guy's face you see here, Chris Latner. Chris is unique as far as I know as being somebody who has built, I think what is the world's most widely used compiler framework, LLVM. He's built the default C and C++ compiler for Mac being clang and he's built what's probably like the world's fastest growing fairly new computer language being Swift. And he's now dedicating his life to deep learning, right? So we haven't had somebody from that world come into our world before. And so when you actually look at stuff like the internals of something like TensorFlow, it looks like something that was built by a bunch of deep learning people, not by a bunch of compiler people, right? And so I've been wanting for over 20 years for there to be a good numerical programming language that was built by somebody that really gets programming languages and it's never happened, you know? So we've had like, in the early days, it was Elisp stat in Lisp and then it was R and then it was Python. None of these languages were built to be good at data analysis. They weren't built by people that really deeply understood compilers. They certainly weren't built for today's kind of modern, highly parallel processor situation we're in. But Swift was, Swift is, right? And so we've got this unique situation where for the first time, you know, a really widely used language, a really well designed language from the ground up is actually being targeted towards numeric programming and deep learning. So there's no way I'm missing out on that boat and I don't want you to miss out on it either, right? I should mention there's another language which you could possibly put in there, which is a language called Julia, which has maybe as much potential. But it's about 10 times less used than Swift. It doesn't have the same level of community but I would still say it's super exciting. So I'd say like this, maybe there's two languages which you might want to seriously consider picking one and spending some time with it. Julia's actually further along. Swift is very early days in this world but that's one of the things I'm excited about for it. So I actually spent some time over the Christmas break, kind of digging into numeric programming in Swift and I was delighted to find that I could create code from scratch that was competitive with the fastest hand-tuned Vendor linear algebra libraries. Even though I was and remain pretty incompetent at Swift, I found it was a language that was really delightful. It was expressive, it was concise but it was also very performant and I could write everything in Swift rather than having to kind of get to some layer where it's like, oh, that's crudian now, or that's MKL now, or whatever. So that got me pretty enthusiastic and so the really exciting news as I'm sure you've heard is that Chris Latner himself is gonna come and join us for the last two lessons and we're gonna teach Swift for deep learning together. So Swift for deep learning means Swift for TensorFlow, that's specifically the library that Chris and his team at Google are working on. We will call that S4TF when I write it down so I couldn't be bothered typing Swift for TensorFlow every time. Swift for TensorFlow has some pros and cons, PyTorch has some pros and cons and interestingly, they're the opposite of each other. PyTorch's and Python's pros, you can get stuff done right now with this amazing ecosystem, fantastic documentation and tutorials. It's just a really great practical system for solving problems and to be clear, Swift for TensorFlow is not. It's not any of those things right now, right? It's really early, almost nothing works. You have to learn a whole new language if you don't know Swift already. There's very little ecosystem. I'm not really about Swift in particular, but the kind of Swift for TensorFlow and Swift for deep learning and even Swift for numeric programming. I was kind of surprised when I got into it to find there was hardly any documentation about Swift for numeric programming even though I was pretty delighted by the experience. People have had this view that Swift is kind of for iPhone programming. I guess that's kind of how it was marketed, right? But actually it's an incredibly well-designed, incredibly powerful language. And then TensorFlow, I mean, to be honest, I'm not a huge fan of TensorFlow in general. I mean, if I was, we wouldn't have switched away from it, but it's getting a lot better. TensorFlow 2 is certainly improving. And the bits of it I particularly don't like, largely the bits that Swift for TensorFlow will avoid. But I think long-term, the kind of things I see happening, like there's this fantastic new kind of compiler project called MLIR, which Chris is also co-leading, which I think actually has the potential long-term to allow Swift to replace most of the yucky bits or maybe even all of the yucky bits of TensorFlow with stuff where Swift is actually talking directly to LLVM. You'll be hearing a lot more about LLVM in the coming in the last two weeks, last two lessons, but basically it's the compiler infrastructure that kind of everybody uses, that Julia uses, that Clang uses, and Swift is this kind of, almost this thin layer on top of it where when you write stuff in Swift, it's really easy for LLVM to compile it down to super fast optimized code, which is like the opposite of Python. With Python, as you'll see today, we almost never actually write Python code. We write code in Python that gets turned into some other language or library, and that's what gets run. And this mismatch, this impedance mismatch between what I'm trying to write and what actually gets run, makes it very hard to do the kind of deep dives that we're gonna do in this course, as you'll see. It's kind of a frustrating experience. So I'm excited about getting involved in these very early days for impractical deep learning in Swift for TensorFlow, because it means that me, and those of you that wanna follow along, can be the pioneers in something that I think is gonna take over this field. We'll be the first in there. We'll be the ones that understand it really well. And in your portfolio, you can actually point at things and say, that library that everybody use, I wrote that. You know, all this piece of documentation that's like on the Swift for TensorFlow website, I wrote that. Now that's the opportunity that you have. So let's put that aside for the next five weeks. And let's try to create a really high bar for the Swift for TensorFlow team to have to try to reimplement before six weeks time, right? We're gonna try to implement as much of fast AI and many parts of PyTorch as we can, and then see if the Swift for TensorFlow team can help us build that in Swift in five weeks time. So the goal is to recreate fast AI from the foundations and much of PyTorch, like metrics multiplication, lot of torch.nn, torch.optm, data set, data loader, from the foundations. And this is the game we're gonna play. The game we're gonna play is we're only allowed to use these bits. We're allowed to use pure Python, anything in the Python standard library, any non-data science modules, right? So like a requests library for HTTP or whatever. We can use PyTorch, but only for creating arrays, random number generation and indexing into arrays. We can use the fastai.datasets library because that's the thing that has access to like MNIST and stuff. So we don't have to worry about writing our own HTTP stuff. And we can use Matplotlib. No, we don't have to write our own plotting library. That's it. That's the game. So we're gonna try and recreate all of this from that. And then the rules are that each time we have replicated some piece of fastai or PyTorch from the foundations, we can then use the real version if we want to, okay? So that's the game we're gonna play. What I've discovered as I started doing that is that I started actually making things a lot better than fastai. So I'm now realizing that fastai version one is kind of a disappointment because there was a whole lot of things I could have done better. And so you'll find the same thing. As you go along this journey, you'll find decisions that I made or the PyTorch team made or whatever where you think, what if they'd made a different decision there? And you can, you know, maybe come up with more examples of things that we could do differently, right? So why would you do this? Well, the main reason is so that you can, like really experiment, right? So you can really understand what's going on in your models, what's really going on in your training. And you'll actually find that in the experiments that we're gonna do in the next couple of classes, we're gonna actually come up with some new insights. If you can create something from scratch yourself, you know that you understand it. And then once you've created something from scratch and you really understand it, then you can tweak everything, right? You suddenly realize that there's not this object detection system and this component architecture and that optimizer. They're all like a kind of semi-arbitrary bunch of particular knobs and choices. And that it's pretty likely that your particular problem would want a different set of knobs and choices. So you can change all of these things. For those of you looking to contribute to open source, to fast AI or to PyTorch, you'll be able to, right? Because you'll understand how it's all built up. You'll understand what bits are working well, which bits need help. You know how to, you know, contribute tests or documentation or new features or create your own libraries. And for those of you interested in going deeper into research, you'll be implementing papers, which means you'll be able to correlate the code that you're writing with the paper that you're reading. And if you're a poor mathematician like I am, then you'll find that you'll be getting a much better understanding of papers that you might otherwise have thought were beyond you. And you realize that all those Greek symbols actually just map to pieces of code that you're already very familiar with. So there are a lot of opportunities in part one to blog and to do interesting things, but the opportunities are much greater now. In part two, you can be doing homework that's actually at the cutting edge, actually doing experiments people haven't done before, making observations people haven't made before, because you're getting to the point where you're a more competent deep learning practitioner than the vast majority that are out there and we're kind of looking at stuff that other people haven't looked at before. So please try doing lots of experiments, particularly in your domain area and consider writing things down, right? Even if, especially if it's not perfect, right? So write stuff down for the you of six months ago. That's your audience. Okay, so I am gonna be assuming that you remember the contents of part one, which was these things. Here is the contents of part one. In practice, it's very unlikely you remember all of these things because nobody's perfect, right? So what I'm actually expecting you to do is as I'm going on about something which you're thinking I don't know what he's talking about, that you'll go back and watch the video about that thing, right? Don't just keep blasting forwards because I'm assuming that you already know the content of part one, right? Particularly if you're less confident about kind of the second half of part one where we kind of went a little bit deeper into like what's an activation really and what's a parameter really and exactly how does SGD work? Particularly in today's lesson, I'm gonna assume that you really get that stuff. So if you don't, then go back and re-look at those videos. Go back to that like SGD from scratch and take your time, right? I've kind of designed this course to keep most people busy up until the next course, right? So feel free to like take your time and dig deeply. So the most important thing though is we're gonna try and make sure that you can train really good models. And there are three steps to training a really good model. Step one is to create something with way more capacity you need and basically no regularization and overfit, right? So overfit means what? It means that your training loss is lower than your validation loss. No, no, it doesn't mean that. Remember, it doesn't mean that. A well-fit model will almost always have training loss lower than the validation loss. Remember that overfit means you have actually personally seen your validation error getting worse. Okay, until you see that happening, you're not overfitting. So step one is overfit and then step two is reduce overfitting and then step three, okay, there is no step three. Well, I guess step three is to like visualize the inputs and outputs and stuff like that, right? Just to experiment and see what's going on. So one is pretty easy normally, right? Two is the hard bit. It's not really that hard but it's basically these are the five things that you can do in order of priority. If you can get more data, you should. If you can do more data augmentation, you should. If you can use a more generalizable architecture, you should. And then if all those things are done, then you can start adding regularization, like dropout or weight decay. But remember, at that point, you're reducing the effective capacity of your model. So it's less good than the first three things. And then last of all, reduce the architecture complexity. And most people, most beginners especially, start with reducing the complexity of the architecture. But that should be the last thing that you try. Unless your architecture is so complex that it's too slow for your problem, okay? So that's kind of a summary of what we want to be able to do that we learned about in part one. Okay, so we're gonna be reading papers, which we didn't really do in part one. And papers look something like this, which if you're anything like me, that's terrifying. And I'm not gonna lie. It's still the case that when I start looking at a new paper every single time, I think I'm not smart enough to understand this. I just can't get past that immediate reaction because I just look at this stuff and I just go, that's not something that I understand. But then I remember, this is the Adam paper and you've all seen Adam implemented in one cell of Microsoft Excel, right? Like when it actually comes down to it, every time I do get to the point where I understand, have it implemented a paper, I go, oh my God, that's all it is, right? So a big part of reading papers, especially if you're less mathematically inclined than I am, is just getting past the fear of the Greek letters. I'll say something else about Greek letters. There are lots of them, right? And it's very hard to read something that you can't actually pronounce, right? Because you're just saying to yourself, oh, squiggle, bracket, one plus squiggle, one G squiggle, one minus squiggle. And it's like all the squiggles, you just get lost, right? So like, believe it or not, it actually really helps to go and learn the Greek alphabet so you can pronounce alpha times one plus beta one. Suddenly you can start talking to other people about it. You can actually read it out loud. It makes a big difference. So learn to pronounce the Greek letters. Note that the people that write these papers are generally not selected for their outstanding clarity of communication, right? So you will often find that there'll be a blog post or a tutorial that does a better job of explaining the concept than the paper does. So don't be afraid to go and look for those as well. But do go back to the paper, right? In the end, the paper's the one that's hopefully got it mainly right. One of the tricky things about reading papers is the equations have symbols and you don't know what they mean and you can't Google for them. So a couple of good resources, if you see symbols you don't recognize, Wikipedia has an excellent list of mathematical symbols page that you can scroll through. And even better, Detectify is a website where you can draw a symbol you don't recognize and it uses the power of machine learning to find similar symbols. There are lots of symbols that look a bit the same, so you will have to use some level of judgment, right? But the thing that it shows here is the latex name and you can then Google for the latex name to find out what that thing means. Okay, so let's start. Here's what we're gonna do over the next couple of lessons. We're going to try to create a pretty competent, modern CNN model. And we actually already have this bit because we did that in the last course, right? We already have our layers for creating a ResNet. We actually got a pretty good result. So we just have to do all these things, okay? To get us from here to here. That's just the next couple of lessons. After that we're gonna go a lot further, right? So today we're gonna try to get to at least the point where we've got the backward pass going, right? So remember, we're gonna build a model that takes an input array and we're gonna try and create a simple fully connected network, right? So it's gonna have one hidden layer. So we're gonna start with some input, do a matrix, multiply, do a value, do a matrix, multiply, do a loss function, okay? And so that's a forward pass and that will tell us our loss and then we will calculate the gradients of the weights and biases with respect to the loss, sorry, of the loss with respect to the weights and biases in order to basically multiply them by some learning rate which we will then subtract off the parameters to get our new set of parameters and we'll repeat that lots of times. So to get to our fully connected backward pass we will need to, first of all, have the fully connected forward pass and the fully connected forward pass means we will need to have some initialized parameters and we'll need a value and we will also need to be able to do matrix multiplication. So let's start there. So let's start at 00 exports notebook. And what I'm showing you here is how I'm gonna go about building up our library in Jupyter notebooks. A lot of very smart people have assured me that it is impossible to do effective library development in Jupyter notebooks, which is a shame because I've built a library in Jupyter notebooks. So anyway, people will often tell you things are impossible but I will tell you my point of view which is that I've been programming for over 30 years and in the time I've been using Jupyter notebooks to do my development I would guess I'm about two to three times more productive, right? I've built a lot more useful stuff in the last two or three years than I did beforehand. I'm not saying you have to do things this way either but this is how I develop and hopefully you find some of this useful as well. So I'll show you how. We need to do a couple of things. We can't just create one giant notebook with our whole library. Somehow we have to be able to pull out those little gems, those bits of code where we think, oh, this is good. Let's keep this. We have to be able to pull that out into a package that we reuse. So in order to tell our system that here is a cell that I want you to keep and reuse, I use this special comment hash export at the top of the cell. And then I have a program called notebook to script which goes through the notebook and finds those cells and puts them into a Python module, all right? So let me show you. So if I run this cell, okay. So if I run this cell and then I head over and notice I don't have to type all of OO exports because I have tab completion even for file names in Jupyter Notebook, so OO tab is enough. And I could either run this here or I could go back to my console and run it. So let's run it here. Okay, so that says converted exports.ipnb to nboo. And what I've done is I've made it so that these things go into a directory called exp for exported modules. And here is that nboo, and there it is, right? So you can see other than a standard header it's got the contents of that one cell. So now I can import that at the top of my next notebook from exp nboo import star. And I can create a test that that variable equals that value. So let's see, it does, okay. And notice there's a lot of test frameworks around but it's not always helpful to use them. Like here we've created a test framework or the start of one. I've created a function called test which checks whether A and B return true or false based on this comparison function by using assert. And then I've created something called test equals which calls test passing in A and B and operator dot equals. Okay, so if they're wrong, assertion error equals test, test one, whoops. Okay, so we've been able to write a test which so far has basically tested that our little module exporter thing works correctly. We probably wanna be able to run these tests somewhere other than just inside a notebook. So we have a little program called runnotebook.py You pass it the name of a notebook and it runs it. So I should save this one with our failing tests so you can see it fail. So first time it passed and then I make the failing test and you can see here it is assertion error and tells you exactly where it happened. Okay, so we now have an automatable unit testing framework in our Jupyter notebook. I'll point out that the contents of these two Python scripts, let's look at them. So the first one was runnotebook.py which is our test runner. There is the entirety of it, right? So there's a thing called nbformat. So if you condor install nbformat then it basically lets you execute a notebook and it prints out any errors. So that's the entirety of that. You'll notice that I'm using a library called fire. Fire is a really neat library that lets you take any function like this one and automatically converts it into a command line interface. So here I've got a function called runnotebook and then it says fire runnotebook. So if I now go Python runnotebook then it says, oh, this function received no value, path usage runnotebook path. So you can see that what it did was it converted my function into a command line interface. It is really great and it handles things like optional arguments and classes and it's super useful, particularly for this kind of Jupyter first development because you can grab stuff that's in Jupyter and turn it into a script often by just copying and pasting the function or exporting it and then just add this one line of code. The other one, notebook to script is not much more complicated. It's one screen of code, which again, the main thing here is to call fire, which calls this one function and you'll see basically it uses JSON.load because notebooks are JSON. The reason I mention this to you is that Jupyter Notebook comes with this whole kind of ecosystem of libraries and APIs and stuff like that and on the whole, I hate them. I find it's just JSON. I find that just doing JSON.load is the easiest way and specifically I build my Jupyter Notebook infrastructure inside Jupyter Notebooks. So here's how it looks, right? Import JSON.load this file and it gives you an array and there's the contents of source, my first row, right? So if you do wanna play around with doing stuff in Jupyter Notebook, it's a really great environment for kind of automating stuff and running scripts on it and stuff like that. So there's that. All right, so that's the entire contents of our development infrastructure. We now have a test, let's make it pass again. One of the great things about having unit tests in notebooks is that when one does fail, you open up a notebook which can have pros saying this is what this test does, it's implementing this part of this paper, you can see all the stuff above it that's setting up all the context for it, you can check in each input and output. It's a really great way to fix those failing tests because you've got the whole, truly literate programming experience all around it. So I think that works great. Okay, so before we start doing matrix multiply, we need some matrices to multiply. So these are some of the things that are allowed by our rules. We've got some stuff that's part of the standard library. This is the fast AI data sets library to let us grab the data sets we need. Some more standard library stuff. We're only allowed to use this for indexing and array creation, map plot lib. There you go. So let's grab MNIST. So to grab MNIST, we can use fast AI data sets to download it and then we can use the standard library GZIP to open it and then we can pickle.load it. So in Python, the kind of standard serialization format is called pickle. And so this MNIST version on deeplearning.net is stored in that format. And so it basically gives us a tuple of tuples of data sets like so. X train, Y train, X valid, Y valid. It actually contains NumPy arrays but NumPy arrays are not allowed in our foundations. So we have to convert them into tensors. So we can just use the Python map to map the tensor function over each of these four arrays to get back four tensors, okay? A lot of you will be more familiar with NumPy arrays than PyTorch tensors but everything you can do in NumPy arrays you can also do in PyTorch tensors but you can also do it on the GPU and have all this nice deep learning infrastructure. So it's a good idea to get used to using PyTorch tensors in my opinion. So we can now grab the number of rows and number of columns in the training search and we can take a look. So here's MNIST, hopefully pretty familiar to you already. It's 50,000 rows by 784 columns and the Y data looks something like this. The Y shape is just 50,000 rows and the minimum and maximum of the dependent variable is zero to nine. So hopefully that all looks pretty familiar. So let's add some tests. So the N should be equal to the shape of the Y should be equal to 50,000. The number of columns should be equal to 28 by 28 because that's how many pixels there are in MNIST and so forth. And we're just using that test equals function that we created just above. So now we can plot it. Okay, so we've got a float tensor and we pass that to IM show after casting it to a 28 by 28. Dot view is really important. I think we saw it a few times in part one but get very familiar with it. This is how we reshape our 768 long vector into a 28 by 28 matrix that's suitable for plotting. Okay, so there's our data and let's start by creating a simple linear model. So for a linear model, we're gonna need to basically have something where Y equals AX plus B and so our A will be a bunch of weights. So it's gonna need to be 784 by 10 matrix because we've got 784 coming in and 10 going out. So that's gonna allow us to take in our independent variable and map it to something which we compare to our dependent variable. And then for our bias, we'll just start with 10 zeros. Okay, so if we're gonna do Y equals AX plus B, then we're gonna need a matrix multiplication. So almost everything we do in deep learning is basically matrix multiplication or a variant thereof, affine functions as we call them. So you wanna be very comfortable with matrix multiplication. So this cool website matrix multiplication dot XYZ shows us exactly what happens when we multiply these two matrices. So we take the first column of the first row and the first row and we multiply each of them element-wise and then we add them up and that gives us that one. And now you can see we've got two sets going on at the same time, so that gives us two more and then two more and then the final one. And that's our matrix multiplication. Okay, so we have to do that, right? So we've got a few loops going on, right? We've got the loop of this thing scrolling down here. We've got the loop of these two rows, they're really columns, so we flip them around and then we've got the loop of the multiply and add. So we're gonna need three loops and so here's our three loops. And notice this is not gonna work unless the number of rows here and the number of columns here, sorry, the number of columns here and the number of rows here are the same. So let's grab the number of rows and columns of A and the number of rows and columns of B and make sure that AC equals BR, just to double check. And then let's create something of size AR by BC because the size of this is gonna be AR by BC with zeros in and then have our three loops. Okay, and then right in the middle, let's do that, okay? So right in the middle, the result in I,J is gonna be AIK by BKJ. And this is the vast majority of what we're gonna be doing in deep learning. So get very, very comfortable with that equation because we're gonna be seeing it in three or four different variants of notation and style in the next few weeks, in the next few minutes, okay? And it's got kind of a few interesting things going on. This I here appears also over here. This J here appears also over here and then the K in the loop appears twice. And look, it's gotta be the same number in each place because this is the bit where we're multiplying together the element-wise things. So there it is. So let's create a nice small version. Grab the first five rows of the validation set, we'll call that M1, and grab our weight matrix, we'll call that M2, grab our weight matrix, call that M2, and then is this sizes five because we just grabbed the first five rows, five by seven, 84, okay? Modified by seven, 84 by 10. So these match as they should. And so now we can go ahead and do that matrix multiplication and it's done, okay? And it's given us 50,000, sorry, length of, sorry, it's given us T1.shape, as you would expect, a five rows by 10 column output. And it took about a second. So it took about a second for five rows. Our data set MNIST is 50,000 rows. So it's gonna take about 50,000 seconds to do a single matrix multiplication in Python. So imagine doing MNIST where every layer for every pass took about 10 hours. Not gonna work, right? So that's why we don't really write things in Python. Like when we say Python is too slow, we don't mean 20% too slow. We mean thousands of times too slow. So let's see if we can speed this up by 50,000 times. Because if we could do that, it might just be fast enough. So the way we speed things up is we start in the innermost loop and we make each bit faster. So the way to make Python faster is to remove Python. And the way we remove Python is by passing our computation down to something that's written in something other than Python, like PyTorch. Because PyTorch, behind the scenes, is using a library called A10, okay? And so we wanna get this going down to the A10 library. So the way we do that is to take advantage of something called element-wise operations. So you've seen them before. For example, if I have two tensors, A and B, both of length three, I can add them together. And when I add them together, it simply adds together the corresponding items. So that's called element-wise addition. Or I could do less than, in which case it's going to do element-wise less than. So what percentage of A is less than the corresponding item of B? A less than B dot float dot mean. We can do element-wise operations on things not just of rank one, but we could do it on a rank two tensor, also known as a matrix. So here's our rank two tensor, M. Let's calculate the Frobenius norm. How many people know about the Frobenius norm? Right, almost nobody. And it looks kind of terrifying, right? But actually it's just this. It's a matrix times itself dot sum dot square root. So here's the first time we're gonna start trying to translate some equations into code to help us understand these equations. So this says when you see something like A with two sets of double lines around it and an F underneath, that means we are calculating the Frobenius norm. So anytime you see this, and you will, it actually pops up semi-regularly in deep learning literature. When you see this, what it actually means is this function. As you probably know, capital Sigma means sum. And this says we're gonna sum over two for loops. The first for loop will be called I and will go from one to N. And the second for loop will also be called, well, sorry, it will be called J and will also go from one to N. And in these nested for loops, we're gonna grab something out of a matrix A, that position, I, J. We're gonna square it. And then we're gonna add all of those together and then we'll take the square root. Okay, which is that. Now, I have something to admit to you. I can't write latex. And yet I did create this Jupyter Notebook. So it looks a lot like I created some latex, which is certainly the impression I like to give people sometimes. But the way I actually write latex is I find somebody else who wrote it and then I copy it. And so the way you do this most of the time is you Google for Frobenius norm, you find the wiki page for Frobenius norm, you click edit next to the equation, and you copy and paste it. Okay, so that's a really good way to do it. And chuck dollar signs or even $2 signs round it. $2 signs make it a bit bigger. So that's way one to get equations. Method two is if it's in a paper on archive, did you know on archive, you can click on download other formats in the top right and then download source. And that will actually give you the original tech source and then you can copy and paste their latex, right? So I'll be showing you a bunch of equations during these lessons and I can promise you one thing, I wrote none of them by hand. So this one was stolen from Wikipedia. All right, so you now know how to implement the Frobenius norm from scratch in TensorFlow. You could also have written it of course as m.pow2. But that would be illegal under our rules, right? We're not allowed to use pow yet, so that's why we did it that way. Okay, so that's just doing the element wise multiplication of a rank two tensor with itself. One times one, two times two, three times three, et cetera. Okay, so that is enough information to replace this loop, right? Cause this loop is just going through the first row of A and the first column of B and doing an element wise multiplication and sum. So our new version is gonna have two loops, not three. Here it is. So this is all the same, right? But now we've replaced the inner loop and you'll see that basically it looks exactly the same as before, but where it used to say K it now says colon. So in PyTorch and NumPy, colon means the entirety of that axis, right? So, Rachel, help me remember the order of rows and columns in when we talk about matrices, which is the song. Row by column, row by column. Yeah, so that's the song. So I is the row number. Okay, so this is row number I the whole row. And this is column number J, the whole column. So multiply all of column J by all of row I and that gives us back a rank one tensor, which we add up. Okay, that's exactly the same as what we had before. And so now that takes 1.45 milliseconds. We've removed one line of code and it's 178 times faster. Okay, so we successfully got rid of that in a loop. And so now this is running in C, right? We didn't really write Python here. We wrote kind of a Pythonic-ish thing that said please call this C code for us. And that made it 178 times faster. Let's check that it's right. We can't really check that it's equal because floats are sometimes changed slightly depending on how you calculate them. So instead, let's create something called near which calls torch.allclose to some tolerance. And then we'll create a test near function that calls our test function using our near comparison. And let's see, yep, passes. Okay, so we've now got our matrix multiplication at 65 microseconds. Now we need to get rid of this loop because now this is our innermost loop. And to do that, we're gonna have to use something called broadcasting. Who here is familiar with broadcasting? About half. Okay, that's what I figured. So broadcasting is about the most powerful tool we have in our toolbox for writing code in Python that runs at C speed. Or in fact, with PyTorch, if you put it on the GPU, it's gonna run at CUDA speed. It allows us to get rid of nearly all of our loops as you'll see, right? The term broadcasting comes from NumPy, but the idea actually goes all the way back to APL from 1962. And it's a really, really powerful technique. It's a lot of people consider it a different way of programming where we get rid of all of our four loops and replace them with these implicit broad-casted loops. In fact, you've seen broadcasting before. Remember our tensor A, which contains 10, 6, 4? If you say A greater than zero, then on the left-hand side, you've got a rank one tensor. On the right-hand side, you've got a scalar, and yet somehow it works. And the reason why is that this value zero is broadcast three times. It becomes zero, comma, zero, comma, zero, and then it does an element-wise comparison. So every time, for example, you've normalized a dataset by subtracting the mean and divided by the standard deviation in kind of one line like this, you've actually been broadcasting. You're broadcasting a scalar to a tensor. So A plus one also broadcasts a scalar to a tensor. And the tensor doesn't have to be rank one. Here we can multiply our rank two tensor by two. So there's the simplest kind of broadcasting. And any time you do that, you're not operating at Python speed, you're operating at C or CUDA speed. So that's good. We can also broadcast a vector to a matrix. So here's a rank one tensor, C. And here's our previous rank two tensor, M. So M's shape is three three, C's shape is three. And yet M plus C does something. What did it do? 10, 20, 30 plus one, two, three. 10, 20, 30 plus four, five, six. 10, 20, 30 plus seven, eight, nine. Ha! It's broadcast this row across each row of the matrix and it's doing that at C speed, right? So this, there's no loop, but it sure looks as if there was a loop. C plus M does exactly the same thing. So we can write C dot expand as M and it shows us what C would look like when broadcast to M. 10, 20, 30, 10, 20, 30, 10, 20, 30. So you can see M plus T is the same as C plus M, right? So basically it's creating or acting as if it's creating this bigger rank two tensor. So this is pretty cool because it now means that any time we need to do something between a vector and a matrix, we can do it at C speed with no loop, right? Now you might be worrying though that this looks pretty memory intensive if we're kind of turning all of our rows into big matrices, but fear not, because you can look inside the actual memory used by PyTorch. So here T is a three by three matrix, but T dot storage tells us that actually it's only storing one copy of that data. T dot shape tells us that T knows it's meant to be a three by three matrix and T dot stride tells us that it knows that when it's going from column to column, it should take one step through the storage, but when it goes from row to row, it should take zero steps. And so that's how come it repeats 10, 20, 30, 10, 20, 30, 10, 20, 30, right? So this is a really powerful thing that appears in pretty much every linear algebra library you'll come across is this idea that you can actually create tenses that behave like higher rank things than they're actually stored as, right? So this is really neat. It basically means that this broadcasting functionality gives us C like speed with no additional memory overhead. Okay, what if we wanted to take a column instead of a row? So in other words, a rank two tensor of shape three comma one. We can create a rank two tensor of shape three comma one from a rank one tensor by using the unsqueeze method. Unsqueeze adds an additional dimension of size one to wherever we have been, wherever we ask for it. So unsqueeze zero, let's check this out. Unsqueeze zero is of shape one comma three. It puts the new dimension in position one. Unsqueeze one is shape three comma one. It creates the new axis in position one. So unsqueeze zero looks a lot like C, right? But now rather than being a rank one tensor, it's now a rank two tensor. See how it's got two square brackets around it, right? See how its size is one comma three, right? Perhaps more interestingly, C dot unsqueeze one now looks like a column, right? It's also a rank two tensor, but it's three rows by one column. Why is this interesting? Because we can say, well actually before we do, I'll just mention, writing dot unsqueeze is kind of clunky. So pie torch and numpie have a neat trick, which is that you can index into an array with a special value none. And none means squeeze a new axis in here, please. So you can see that C none colon is exactly the same shape one comma three as C dot unsqueeze zero. And C colon comma none is exactly the same shape as C dot unsqueeze one. So I hardly ever use unsqueeze unless I'm like particularly trying to demonstrate something for teaching purposes. I pretty much always use none. Apart from anything else, I can add additional axes this way. Or else with unsqueeze, you have to go unsqueeze, unsqueeze, unsqueeze. So this is handy. So why did we do all that? The reason we did all that is because if we go C colon comma none, so in other words, we turned it into a column, in fact, a column, kind of a columnar shape. So it's now of a shape three comma one dot expand as. It doesn't now say 10, 20, 30, 10, 20, 30, 10, 20, 30, but it says 10, 10, 10, 20, 20, 20, 30, 30, 30. So in other words, it's getting broadcast along columns instead of rows. So as you might expect, if I take that and add it to M, then I get the result of broadcasting the column. So it's now not 11, 22, 33, but 11, 12, 13. So everything makes more sense in Excel. Let's look. So here's broadcasting in Excel, right? Here is a one comma three shape rank two tensor. So we can use the rows and columns functions in Excel to get the rows and columns of this object. Here is a three by one rank two tensor. Again, rows and columns. And here is a two by two rank, sorry, three by three rank two tensor. As you can see, rows by columns. So here's what happens if we broadcast this to be the shape of M, okay? And here is the result of that C plus M. And here's what happens if we broadcast this to that shape. And here is the result of that addition. And there it is, 11, 12, 13, 24, 25, 26, right? Ba-da-da, okay? So basically what's happening is when we broadcast, it's taking the thing which has a unit axis and it's kind of effectively copying that unit axis so it is as long as the larger tensor on that axis. But it doesn't really copy it, it just pretends as if it's been copied. So we can use that to get rid of our loop. So this was the loop we were trying to get rid of. Going through each of range BC. And so here it is. So now we are not anymore going through that loop. So now rather than setting CI comma J, we can set the entire row of CI, right? This is the same as CI comma colon, right? Anytime there's a trailing colon in NumPy or PyTorch, you can delete it optionally, right? You don't have to. So before, we had a few of those, right? Let's see if we can find one. Here's one, comma colon. So I'm claiming we could have got rid of that. Let's see. Yep, still torch size one comma three, right? And similar thing, anytime you see any number of colon commas at the start, you can replace them with a single ellipsis, which in this case doesn't save us anything because there's only one of these, but if you've got like a really high-ranked tensor, that can be super convenient, especially if you want to do something where the rank of the tensor could vary. You don't know how big it's gonna be ahead of time. So we're gonna set the whole of row i, and we don't need that colon, though it doesn't matter if it's there. And we're gonna set it to the whole of row i of a, okay? And then now that we've got row i of a, that is a rank one tensor. So let's turn it into a rank two tensor, okay? So it's now got a new, and see how this is minus one. So minus one always means the last dimension, right? So how else could we have written that? We could also have written it like that with a special value in none, okay? So this is of now length, whatever the size of a is, which is a, r. So it's of length, it's of shape a, r, comma one, right? So that is a rank two tensor, and b is also a rank two tensor. That's the entirety of our matrix, right? And so this is gonna get broadcast over this. It is exactly what we want. We want it to get rid of that loop. And then, so that's gonna return, because it broadcasts, it's actually gonna return a rank two tensor. And then that rank two tensor, we wanna sum it up over the rows. And so sum, you can give it a dimension argument to say which axis to sum over. So this one is kind of our most mind-bending broadcast of the lesson. So I'm gonna leave this as a bit of homework for you to go back and convince yourself as to why this works. So maybe put it in Excel or do it on paper. If it's not already clear to you why this works. But this is sure handy, because before we were broadcasting that, we were at 1.39 milliseconds. After using that broadcasting, we're down to 250 microseconds. So at this point, we're now 3,200 times faster than Python. And it's not just speed. Once you get used to this style of coding, getting rid of these loops, I find really reduces a lot of errors in my code. It takes a while to get used to, but once you're used to it, it's a really comfortable way of programming. Once you get to kind of higher ranked tensors, this broadcasting can start getting a bit complicated. So what you need to do instead of trying to keep it all in your head is apply the simple broadcasting rules. Here are the rules. I've listed them here. In NumPy and PyTorch and TensorFlow, it's all the same rules. What we do is we compare the shapes element-wise. So let's look at a slightly interesting example. Here is our rank one tensor C and let's insert a leading unit axis. So this is a shape 1,3. See how there's two square brackets? And here's the version with a, sorry, this one's a preceding axis. This one's a trailing axis. So this is a shape 3,1. And we should take a look at that. So just to remind you that looks like a column. What if we went C,n, colon times C, colon, none? What on earth is that? And so let's go back to Excel. Here's our row version. Here's our column version. What happens is it says, okay, you wanna multiply this by this element-wise, right? This is not the at sign. This is asterisk, so element-wise multiplication. It broadcasts this to be the same number of rows as that, like so. And it broadcasts this to be the same number of columns as that, like so. And then it simply multiplies those together. That's it, right? So the rule that it's using, you can do the same thing with greater than, right? The rule that it's using is, let's look at the two shapes, one, three, and three, one, and see if they're compatible. They're compatible if element-wise, they're either the same number or one of them is one. So in this case, one is compatible with three, because one of them is one. And three is compatible with one, because one of them is one. And so what happens is if it's one, that dimension is broadcast to make it the same size as the bigger one. Okay, so three comma one became three comma three. So this one was multiplied three times down the rows and this one was multiplied three times down the columns. And then there's one more rule, which is that they don't even have to be the same rank, right? So something that we do a lot with image normalization is we normalize images by channel, right? So you might have an image which is 256 by 256 by three and then you've got the per channel mean, which is just a rank one tensor of size three. They're actually compatible because what it does is anywhere that there's a missing dimension, it inserts a one there at the start. It inserts leading dimensions and inserts a one. So that's why actually you can normalize by channel with no lines of code. Mind you in PyTorch it's actually channeled by height by width, so it's slightly different, but this is the basic idea. So this is super cool. We're gonna take a break, but we're getting pretty close. My goal was to make our Python code 50,000 times faster, we're up to 4,000 times faster. And the reason this is really important is because if we're gonna be like doing our own stuff, building things that people haven't built before, we need to know how to write code that we can write quickly and concisely, but operates fast enough that it's actually useful, right? And so this broadcasting trick is perhaps the most important trick to know about. So let's have a six minute break and I'll see you back here at eight o'clock. So broadcasting, when I first started teaching deep learning here, and I asked how many people are familiar with broadcasting, this is back when we used to do it in the piano, almost no hands went up, so I used to kind of say this is like my secret magic trick. I think it's really cool, it's kind of really cool that now half of you have already heard of it, and it's kind of sad, because it's now not my secret magic trick, it's like here's something half of you already knew. But the other half of you, there's a reason that people are learning this quickly, and it's because it's super cool. Here's another magic trick. How many people here know Einstein summation notation? Okay, good, good, almost nobody. So it's not as cool as broadcasting, but it is still very, very cool. Let me show you, right? And this is a technique which I don't think it was invented by Einstein. I think it was popularized by Einstein as a way of dealing with these high-ranked tensor kind of reductions that he used in the general relativity, I think. Here's the trick. This is our, the innermost part of our original matrix multiplication for loop. Remember, right? And here's the version when we removed the innermost loop and replaced it with an element-wise product. And you'll notice that what happened was that the repeated K got replaced with a colon. Okay, so watch this. What if I move, okay, so first of all, let's get rid of the names of everything and let's move this to the end and put it after an arrow. And let's keep getting rid of the names of everything and get rid of the commas and replace the bases with commas. Okay, and now I just created Einstein summation notation. So Einstein summation notation is like a mini language. You put it inside a string, right? And what it says is, however many, so there's an arrow, right? And on the left of the arrow is the input and on the right of the arrow is the output. How many inputs do you have? Well, they're delimited by comma. So in this case, there's two inputs. The inputs, what's the rank of each input? It's however many letters there are. So this is a rank two input and this is another rank two input and this is a rank two output. How big are the inputs? There, if this is one is the size I by K, this one is the size K by J and the output is of size I by J. When you see the same letter appearing in different places, it's referring to the same size dimension. So this is of size I, the output is always has, also has I rows. This has J columns. The output also has J columns. All right, so we know how to go from the input shape to the output shape. What about the K? You look for any place that a letter is repeated and you do a dot product over that dimension. In other words, it's just like the way we replaced K with colon, okay? So this is going to create something of size I by J by doing dot products over these shared Ks, which is matrix multiplication, okay? So that's how you write matrix multiplication with Einstein summation notation. And then all you just do is go torch.insum. If you go to the PyTorch-Insum docs or docs of most of the major libraries, you can find all kinds of cool examples of insum. You can use it for transpose, diagonalization, tracing, all kinds of things, batch-wise versions of just about everything. So for example, if PyTorch didn't have batch-wise matrix multiplication, I just created it. There's batch-wise matrix multiplication, right? So there's all kinds of things you can kind of invent. And often it's quite handy if you kind of need to put a transpose in somewhere or you know, tweak things to be a little bit different. You can use this. So that's Einstein summation notation. Here's Matt Moll. And that's now taken us down to 57 microseconds. So we're now 16,000 times faster than Python. I will say something about insum. It's a travesty that this exists because we've got a little mini language inside Python in a string. I mean, that's horrendous. You shouldn't be writing programming languages inside a string. This is as bad as a regex, you know? Like regex expressions are also mini languages inside a string. You want your languages to be like typed and have IntelliSense and like be things that you can like, you know, extend. This mini language does, it's amazing, but there's so few things that it actually does, right? What I actually want to be able to do is create like any kind of arbitrary combination of any axes and any operations and any reductions I like in any order in the actual language I'm writing in, right? So that's actually what APL does. That's actually what J and K do. These are the J and K of the languages that kind of came out of APL. That this is a kind of a series of languages that have been around for about 60 years and everybody's pretty much failed to notice. My hope is that things like Swift and Julia will give us this, like the ability to actually write stuff in actual Swift and actual Julia that we can run in an actual debugger and use an actual profiler and do arbitrary stuff that's really fast. And actually, Swift seems like it might go even quite a bit faster than AynSum in an even more flexible way thanks to this new compiler infrastructure called MLIR, which actually builds off there's some really exciting new research in the compiler world. It's kind of been coming over the last few years, particularly coming out of a system called Halide, which is H-A-L-I-D-E, which is this super cool language that basically showed it's possible to create a language that can create like very, very, very like totally optimized kind of linear algebra computations in a really flexible, convenient way. And since that came along, there's been all kinds of cool research using these techniques like something called polyhedral compilation, which kind of have the promise that we're gonna be able to hopefully within the next couple of years write Swift code that runs as fast as the next thing I'm about to show you. Because the next thing I'm about to show you is the PyTorch operation called MatMol. And MatMol takes 18 microseconds, which is 50,000 times faster than Python. Why is it so fast? Well, if you think about what you're doing when you do a matrix multiply of something that's like 50,000 by 768 by 768 by 10, you know, these are things that aren't gonna fit in like the cache in your CPU. So if you do the kind of standard thing of going down all the rows and across all the columns, by the time you've got to the end and you go back to exactly the same column again, it forgot the contents and has to go back to RAM and pull it in again, right? So if you're smart, what you do is you break your matrix up into little smaller matrices and you do a little bit at a time. And that way everything's kind of in cache and it goes super fast. Now, normally to do that, you have to write kind of assembly language code, particularly if you want to kind of get it all running in your vector processor. And that's how you get these 18 microseconds. So currently to get a fast matrix multiply, things like PyTorch, they don't even write it themselves. They basically push that off to something called a BLAS, a BLAS is a basic linear algebra sub-programs library where companies like Intel and AMD and NVIDIA write these things for you, right? So you can like look up kublas, for example, and this is like NVIDIA's version of BLAS or you could look up MKL and this is Intel's version of BLAS and so forth, right? And this is kind of awful because the program is limited to this like subset of things that your BLAS can handle. And to use it, you don't really get to write it in Python. You kind of have to write the one thing that happens to be turned into that preexisting BLAS code. So this is kind of why we need to do better, right? And there are people working on this and there are people actually in Chris Latner's team working on this, you know, there's some really cool stuff like there's something called tensor comprehensions, which is like really originally came in PyTorch and I think they're now inside Chris's team at Google where people are basically saying, hey, here are ways to like compile these much more general things. And this is what we want as more advanced practitioners. Anyway, for now, in PyTorch world, we're stuck at this level, which is to recognize there are some things, this is, you know, three times faster than the best we can do in an even vaguely flexible way. And if we compare it to the actually flexible way, which is broadcasting, we had 254, yeah, so still, you know, over 10 times better. Right, so wherever possible, today we wanna use operations that are predefined in our library, particularly for things that kind of operate over lots of rows and columns, the things we're kind of dealing with this memory caching stuff is gonna be complicated, so keep an eye out for that. Matrix multiplication is so common and useful that it's actually got its own operator, which is at, these are actually calling the exact same code, so they're the exact same speed. At is not actually just matrix multiplication. At covers a much broader array of kind of tensor reductions across different levels of axes. So it's worth checking out what MatBowl can do, because often it'll be able to handle things like batch-wise or matrix versus vectors. Don't think of it as being only something that can do rank two by rank two, because it's a little bit more flexible. Okay, so that's that. We have matrix multiplication, and so now we're allowed to use it. And so we're gonna use it to try to create a forward pass, which means we first need value and matrix initialization, because remember a model contains parameters which start out randomly initialized. And then we use the gradients to gradually update them with SGD. So let's do that. So here is O2. So let's start by importing NBO1, and I just copied and pasted the three lines we used to grab the data, and I'm just gonna pop them into a function so we can use it to grab MNIST when we need it. And now that we know about broadcasting, let's create a normalization function that takes our tensor and subtracts the means and divides by the standard deviation. So now let's grab our data, okay, and pop it into x, y, x, y. Let's grab the mean and standard deviation, and notice that they're not zero and one. Why would they be, right? But we want them to be zero and one. We're gonna be seeing a lot of why we want them to be, a lot about why we want them to be zero and one over the next couple of lessons. But for now, let's just take my word for it, we want them to be zero and one. So that means that we need to subtract the mean divide by the standard deviation. But not for the validation set. We don't subtract the validation sets mean and divide by the standard validation set standard deviation. Because if we did, those two data sets would be on totally different scales, right? So if the training set was mainly green frogs and the validation set was mainly red frogs, right? Then there goes gotta be, then if we normalize with the validation sets mean and variance, we would end up with them both having the same like average coloration, and we wouldn't be able to tell the two apart, right? So that's an important thing to remember when normalizing is to always make sure your validation and training set normalized in the same way. So after doing that, twice, okay. So after doing that, our mean is pretty close to zero and our standard deviation is very close to one and it would be nice to have something to easily check that these are true. So let's create a test near zero function and then test that the mean is near zero and one minus the standard deviation is near zero and that's all good. Let's define N and M and C the way but same as before. So the size of the training set and the number of activations we're gonna eventually need in our model being C. And let's try to create our model. Okay. So the model is going to have one hidden layer and normally we would want the final output to have 10 activations because we would use cross entropy against those 10 activations but to simplify things for now we're gonna not use cross entropy we're gonna use mean squared error which means we're gonna have one activation, okay? Which makes no sense from our modeling point of view we'll fix that later but just to simplify things for now. So let's create a simple neural net with a single hidden layer and a single output activation which we're gonna use mean squared error. So let's pick a hidden size so the number of hidden will make 50, okay? So our two layers we're gonna need two weight matrices and two bias vectors. So here are our two weight matrices W1 and W2. So they're random numbers normal random numbers of size M which is the number of columns, 768 by NH, number of hidden and then this one is NH by one. Now inputs now mean zero standard deviation one the inputs to the first layer we want the inputs to the second layer to also be mean zero standard deviation one. Well, how are we gonna do that? Because if we just grabbed some normal random numbers and then we define a function called linear this is our linear layer which is X by W plus B, all right? And then create T which is the out the activation of that linear layer with our validation set and our weights and biases. We have a mean of minus five and a variance standard deviation of 27 which is terrible, all right? So I'm gonna let you work through this at home, right? But once you actually look at what happens when you multiply those things together and add them up as you do in matrix multiplication you'll see that you're not gonna end up with zero one but if instead you divide by square root M so root 768, then it's actually damn good, okay? So this is a simplified version of something which PyTorch calls chiming initialization named after chiming He who wrote a paper or was the lead writer of a paper that we're looking at, look at in a moment. So the weights, Rand N gives you random numbers with a mean of zero and a standard deviation of one. So if you divide by root M, it will have a mean of zero and a standard deviation of one on root M. So we can test those. So in general, normal random numbers of mean zero and standard deviation of one over root of whatever this is. So here it's M and here it's NH. We'll give you an output of zero comma one. Now this may seem like a pretty minor issue but as we're gonna see in the next couple of lessons it's like the thing that matters when it comes to training neural nets. It's actually in the last few months people have really been noticing how important this is. There are things like fix up initialization where these folks actually trained a 10,000 layer deep neural network with no normalization layers just by basically doing careful initialization. So it's really people are really spending a lot of time now thinking like, okay, how we initialize things is really important. And we've had a lot of success with things like one cycle training and super convergence. Which is all about what happens in those first few iterations and it's really turns out that it's all about initializations. So we're gonna be spending a lot of time studying this in depth. So the first thing I'm gonna point out is that this is actually not how our first layer is defined. Our first layer is actually defined like this. It's got a value on it. So first let's define value. So value is just grab our data and replace any negatives with zeros. That's all clamp min means. Now there's lots of ways I could have written this but if you can do it with something that's like a single function in PyTorch it's almost always faster because that thing's generally written in C for you. So try to find the thing that's as close to what you want as possible. There's a lot of functions in PyTorch. So that's a good way of implementing value. And unfortunately, that does not have the mean zero and standard deviation of one. Why not? Well, there's my stylus. Okay, so we had some data that had a mean of zero and a standard deviation of one. And then we took everything that was smaller than zero and removed it. So that obviously does not have a mean of zero and it obviously now has about half the standard deviation that it used to have. So this was one of the fantastic insights and one of the most extraordinary papers of the last few years. It was the paper from the 2015 ImageNet winners led by the person we've mentioned, Kaiming Huat. Kaiming at that time was at Microsoft Research. And this is full of great ideas. Reading papers from competition winners is a very, very good idea because they tend to be, normal papers will have like one tiny tweak that they spend pages and pages trying to justify why they should be accepted into New Europe's whereas competition winners have 20 good ideas and only time to mention them in passing. This paper introduced us to Resnets, Prelude Layers and Kaiming Initialization amongst others. So here is section 2.2. Section 2.2, initialization of filter weights for rectifiers. What's a rectifier? A rectifier is a rectified linear unit or rectifier network is any neural network with rectifier linear units in it. This is only 2015 but it already reads like something from another age. In so many ways, like even the word rectifier units and traditional sigmoid activation networks, no one uses sigmoid activations anymore. So a lot's changed since 2015. So when you read these papers you kind of have to keep these things in mind. They describe how what happens if you train very deep models with more than eight layers? So things have changed, right? But anyway, they said that in the old days people used to initialize these with random Gaussian distributions. So this is a Gaussian distribution. It's just a fancy word for normal or bell shaped. And when you do that they tend to not train very well. And the reason why they point out or actually Glorow and Benjio pointed out, let's look at that paper. So you'll see two initializations come up all the time. One is either Keiming or Her initialization which is this one. Or the other you'll see a lot is Glorow or Xavier initialization again named after Xavier Glorow. This is a really interesting paper to read. It's a slightly older one. It's from 2010, being massively influential. And one of the things you'll notice if you read it is it's very readable. It's very practical. And the actual final result they come up with is it's incredibly simple. And we're actually gonna be re-implementing much of the stuff in this paper over the next couple of lessons. But basically they describe one suggestion for how to initialize neural nets. And they suggest this particular approach which is route six over the route of the number of input filters plus the number of output filters. And so what happened was Keiming Her and that team pointed out that that does not account for the impact of a value. The thing that we just noticed. So this is a big problem, right? If your variance halves each layer and you have a massive deep network with like eight layers then you've got one over two to the eight squishes. Like by the end, it's all gone. And if you wanna be fancy like the fixed up people with 10,000 layers, forget it, right? Your gradients have totally disappeared. So this is totally unacceptable. So they do something super genius smart. They replace the one on the top with a two on the top. So this, which is not to take anything away from this. It's a fantastic paper, right? But in the end, the thing they do is to stick a two on the top. So we can do that by taking that exact equation we just used and sticking a two on the top. But if we do, then the result is much closer. It's not perfect, right? But it actually varies quite a lot. It's really random. Sometimes it's quite close. Sometimes it's further away, but it's certainly a lot better than it was. So that's good, right? And it's really worth reading. So more homework for this week is to read 2.2 of the ResNet paper. And what you'll see is that they describe what happens in the forward pass of a neural net and they point out that for the conf layer, this is the response, y equals Wx plus b. Now, if you're concentrating, that might be confusing because a conf layer isn't quite y equals Wx plus b. A conf layer has a convolution. But you remember in part one, I pointed out this neat article from Matt Klein Smith where he showed that CNNs in convolutions actually are just matrix multiplications with a bunch of zeros and some tight weights, right? So this is basically all they're saying here. So sometimes there are these kind of like throwaway lines and papers that are actually quite deep and worth thinking about. So they point out that you can just think of this as a linear layer and then they basically take you through step by step what happens to the variance of your network depending on the initialization. And so just try to get to this point here. Get as far as backward propagation case. So you've got about, I don't know, six paragraphs to read. None of the math notation is weird. Maybe this one is if you haven't seen this before. This is exactly the same as Sigma but instead of doing a sum, you do a product, okay? So this is a great way to kind of warm up your paper reading muscles is to try and read this section. And then if that's going well, you can keep going with the backward propagation case because the forward pass does a matrix multiply. And as we'll see in a moment, the backward pass does a matrix multiply with a transpose of the matrix. So the backward pass is slightly different but it's nearly the same. And so then at the end of that, they will eventually come up with their suggestion. Let's see if we can find it. Oh yeah, here it is. They suggest root two over NL where NL is the number of input activations. Okay, so that's what we're using. That is called chiming initialization and it gives us a pretty nice variance. It doesn't give us a very nice mean though, right? And the reason it doesn't give us a very nice mean is because as we saw, we deleted everything below the axis. So naturally our mean is now half, not zero. I haven't seen anybody talk about this in the literature. But something I was just trying over the last week is something kind of obvious, which is to replace value with not just x.clamp-min, but x.clamp-min minus 0.5. And in my brief experiments, that seems to help. So there's another thing that you could try out and see if it actually helps or if I'm just imagining things. It certainly returns you to the correct mean. Okay, so now that we have this formula, we can replace it with init.claming-normal according to our rules, because it's the same thing. And let's check that it does the same thing. It does, okay? So again, we've got this about half mean and bit under one standard deviation. You'll notice here I had to add something extra, which is mode equals fan out. What does that mean? What it means is explained here. Fan in or fan out. Fan in preserves the magnitude of variance in the forward pass. Fan out preserves the magnitudes in the backward pass. Basically all it's saying is, are you dividing by root m or root nh? Because if you divide by root m, as you'll see in that part of the paper, I was suggesting you read, that will keep the variance at one during the forward pass. But if you use nh, it will give you the right unit variance in the backward pass. So it's weird that I had to say fan out because according to the documentation, that's for the backward pass to keep the unit variance. So why did I need that? Well, it's because our weights shape is 784 by 50. But if you actually create a linear layer with PyTorch of the same dimensions, it creates it of 50 by 784. It's the opposite. So how can that possibly work? And these are the kind of things that it's useful to know how to dig into. So how is this working? So to find out how it's working, you have to look in the source code. So you can either set up Visual Studio Code or something like that and kind of set it up so you can jump between things. That's a nice way to do it. Or you can just do it here with question mark, question mark. And you can see that this is the forward function and it calls something called f.linear. In PyTorch, capital F always refers to the torch.nn.functional module. Because you use it so, you like it's used everywhere. So they decided that's worth a single letter. So torch.nn.functional.linear is what it calls. And let's look at how that's defined. Input.matmowweight.t. T means transpose. Okay, so now we know in PyTorch, a linear layer doesn't just do a matrix product. It does a matrix product with a transpose. So in other words, it's actually gonna turn this into 784 by 50 and then do it. And so that's why we kind of had to give it the opposite information when we were trying to do it with our linear layer, which doesn't have transpose. So the main reason I show you that is to kind of show you how you can dig in to the PyTorch source code, see exactly what's going on. And when you come across these kind of questions, you wanna be able to answer them yourself. Which also then leads to the question, if this is how linear layers can be initialized, what about convolutional layers? What does PyTorch do for convolutional layers? So we could look inside torch.nn.conf2d. And when I looked at it, I noticed that it basically doesn't have any code. It just has documentation. All of the code actually gets passed down to something called underscore confnd. And so you need to know how to find these things. And so if you go to the very bottom, you can find the file name it's in. And so you see this is actually torch.nn.modules.conf. So we can find torch.nn.modules.conf.underscoreconvnd. And so here it is. And here's how it initializes things. And it calls chiming uniform, which is basically the same as chiming normal, but it's uniform instead. But it has a special multiplier of math.square root five. And that is not documented anywhere. I have no idea where it comes from. And in my experiments, this seems to work pretty badly as you'll see. So it's kind of useful to look inside the code. And when you're writing your own code, it like presumably somebody put this here for a reason. Wouldn't it have been nice if they had a URL above it with a link to the paper that they're implementing so we could see what's going on. So that's always a good idea. You know, is to put some comments in your code to let the next person know, what the hell are you doing? So that particular thing I have a strong feeling isn't great as you'll see. Okay, so we're gonna try this thing. It's attracting 0.5 from our value. So like, this is pretty cool, right? We've already designed our own new activation function. Is it great? Is it terrible? I don't know, but like it's this kind of level of tweak which is kind of, you know, when people write papers, this is the normal level of like, it's like a minor change to one line of code. It'd be interesting to see how much it helps. But if I use it, then you can see here, yep, now I have a mean. That's zero thereabouts. And interestingly, I've also noticed it helps my variance a lot. Before, my variance remember was generally around 0.7 to 0.8 but now it's generally above 0.8. So it helps both, which makes sense as to why I think I'm seeing these better results. So now we have value. We have linear. We have in it. So we can do a forward pass, right? So we're now up to here. And so here it is. And remember in PyTorch, a model can just be a function. And so here's our model. It's just a function that does one linear layer, one value layer and one more linear layer. And let's try running it. And okay, it takes eight milliseconds to run it, the model on the validation set. So it's plenty fast enough to train. It's looking good. Add an assert to make sure this shape seems sensible. So the next thing we need for our forward pass is a loss function. And as I said, we're gonna simplify things for now by using mean squared error, even though that's obviously a dumb idea. Our model is returning something of size 10,000 by one. But mean squared error, you would expect it just to be a single vector of size 10,000. So I wanna get rid of this unit axis. In PyTorch, the thing to add a unit axis we've learned is called un-squeeze. The thing to get rid of a unit axis therefore is called squeeze. So we just go output.squeeze to get rid of that unit axis. But actually, now I think about it, this is lazy. Because output.squeeze gets rid of all unit axes and we very commonly see on the FastAO forums people saying that their code's broken and it's when they've got squeeze and it's that one case where maybe they had a batch size of size one. And so that one comma one will get squeezed down to a scalar and things would break. So rather than just calling squeeze it's actually better to say which dimension you wanna squeeze, which we could write either one or minus one, it would be the same thing. And this is gonna be more resilient now to that weird edge case of a batch size of size one. Okay, so output minus target squared mean, that's mean squared error. Okay, so remember in PyTorch, lost functions can just be functions, right? For mean squared error we're gonna have to make sure these are floats so let's convert them. So now we can calculate some predictions. That's the shape of our predictions and we can calculate our mean squared error. So there we go. So we've done a forward pass. So we're up to here. A forward pass is useless. What we need is a backward pass because that's the thing that tells us how to update our parameters. So we need gradients. Okay, how much do you wanna know about matrix calculus? I don't know. It's up to you. But if you want to know everything about matrix calculus, I can point you to this excellent paper by Terence Parr and Jeremy Howard which tells you everything about matrix calculus from scratch. So this is a few weeks work to get through but it absolutely assumes nothing at all, right? So even like basically Terence and I both felt like, oh, we don't know any of this stuff. Let's learn all of it and tell other people. And so we wrote it with that in mind and so this will take you all the way up to knowing everything that you need for deep learning. You can actually get away with a lot less but if you're here, yeah, maybe it's worth it. But I'll tell you what you do need to know. What you need to know is the chain rule. All right, because let me point something out. We start with some input. We start with some input and we stick it through the first linear layer and then we stick it through RELU and then we stick it through the second linear layer and then we stick it through MSE and that gives us our predictions, all right? That. Or to put it another way, we start with X and we put it through the function, Len 1 and then we take the output of that and we put it through the function, RELU and then we take the output of that and we put it through the function, Len 2 and then we take the output of that and we put it through the function, MSE. And strictly speaking, MSE has a second argument which is the actual target value and we want the gradient of the output with respect to the input. So it's a function of a function of a function of a function. So if we simplify that down a bit, we could just say like, what if it's just like, I don't know, Y equals F of X, sorry, Y equals F of U and U equals F of X. So that's like a function of a function, simplify it a little bit. Then the derivative is that, that's the chain rule. If that doesn't look familiar to you or you've forgotten it, go to Khan Academy. Khan Academy has some great tutorials on the chain rule but this is actually the thing we need to know because once you know that, then all you need to know is the derivative of each bit on its own and you just multiply them all together. And if you ever forget the chain rule, just cross multiply. So that would be dy du over du dx cross out du's, you get dy dx, right? And if you went to like a fancy school, they would have told you not to do that. They said you can't treat calculus like this because there's special magic small things. Actually, you can. There's actually a different way of treating calculus called the calculus of infinitesimals which where all of this just makes sense and you suddenly realize you actually can do this exact thing. So anytime you see a derivative, just remember that all it's actually doing is it's taking some function, right? And it's saying as you go across a little bit, how much do you go up? And then it's dividing that change in y divided by that change in x. That's literally what it is, where y and x, you must make them small numbers. So they behave very sensibly when you just think of them as a small change in y over a small change in x as I just did showing you the chain rule. So to do the chain rule, we're going to have to start with the very last function. The very last function on the outside was the loss function. It means great error. So we start by, so we just do each bit separately. So the gradient of the loss with respect to, what should I say? Output of previous layer. Okay, so the output of the previous layer, the MSE is just input minus target squared. And so the derivative of that is just two times input minus target, because the derivative of blah squared is two times blah. Okay, so that's it. Now I need to store that gradient somewhere. Now the thing is that for the chain rule, I'm going to need to multiply all these things together. So if I store it inside the dot G attribute of the previous layer, because remember this is the previous layer, then when the previous layer, so the input of MSE is the same as the output of the previous layer. So if I store it away in here, I can then quite comfortably refer to it. So here, look, ReLU, let's do ReLU. So ReLU is this, okay, what's the gradient there? Zero, what's the gradient there? One, so therefore, that's the gradient of the ReLU. It's just greater than zero. But we need the chain rule, okay, so we need to multiply this by the gradient of the next layer, which remember we stored away. Okay, so we can just grab it. So this is really cool. So same thing for the linear layer. The gradient is simply, and this is where the matrix calculus comes in, the gradient of a matrix product is simply the matrix product with the transpose. So you can either read all that stuff I showed you or you can take my word for it. So here's the cool thing, right? Here's the function which does the forward pass that we've already seen, and then it goes backwards, it calls each of the gradients backwards, right, in reverse order, because we know we need that for the chain rule. And you can notice that every time we're passing in the result of the forward pass, and it also has access, as we discussed, to the gradient of the next layer. This is called back propagation, right? So when people say, as they love to do, back propagation is not just the chain rule, they're basically lying to you. Back propagation is the chain rule, where we just save away all the intermediate calculations so we don't have to calculate them again, okay? So this is a full forward and backward pass. One interesting thing here is this value here loss, this value here loss, we never actually use it, because the loss never actually appears in the gradients. I mean, just by the way, you still probably want it to be able to print it out or whatever, but it's actually not something that appears in the gradients. So that's it. So w1.g, w2.g, et cetera, they now contain all of our gradients, which we're gonna use for the optimizer. And so let's cheat and use PyTorch AutoGrad to check our results because PyTorch can do this for us. So let's clone all of our weights and biases and input, and then turn on requires grad for all of them. So requires grad underscore is how you take a PyTorch tensor and turn it into a magical auto-gradified PyTorch tensor. So what it's now gonna do is everything that gets calculated with test tensor, it's basically gonna keep track of what happened. So it basically keeps track of these steps so that then it can do these things. It's not actually that magical, right? You could totally write it yourself. You just need to make sure that each time you do an operation, you remember what it is, and so then you can just go back through them in reverse order. Okay, so now that we've done requires grad, we can now just do the forward pass like so, that gives us loss in PyTorch, you say loss.backward, and now we can test that, and remember PyTorch doesn't store things in .g, it stores them in .grad, and we can test them and all of our gradients were correct, or at least they're the same as PyTorch's. Okay, so that's pretty interesting, right? I mean, that's an actual neural network that kind of contains all the main pieces that we're gonna need, and we've written all these pieces from scratch, so there's nothing magical here, right? But let's do some core refactoring. I really love this refactoring, and this is massively inspired by, in fact, very closely stolen from the PyTorch API, but it's kind of interesting, I didn't have the PyTorch API in mind as I did this, but as I kept refactoring, I kind of noticed like, oh, I just recreated the PyTorch API. That makes perfect sense. So let's take each of our layers, Rallyu and Linear, and create classes, right? And for the forward, let's use Dundercall. Now, do you remember that Dundercall means that we can now treat this as if it was a function, right? So if you call this class just with parentheses, it calls this function. And let's save the input, let's save the output, and let's return the output, right? And then backward, do you remember this was our backward pass? Okay, so it's exactly the same as we had before, okay? But we're gonna save it inside self.input.gradient. So this is exactly the same code as we had here, okay? But I've just moved the forward and backward into the same class, right? So here's linear, forward, exactly the same, but each time I'm saving the input, I'm saving the output, I'm returning the output, and then here's our backward. One thing to notice, the backward pass here, we, for linear, we don't just want the gradient of the outputs with respect to the inputs, we also need the gradient of the outputs with respect to the weights and the output with respect to the biases, right? So that's why we've got three lots of dot Gs going on here. Okay, so there's our linear layers, forward and backward. And then we've got our mean squared error, okay? So there's our forward, and we'll save away both the input and the target for using later. And there's our gradient, again same as before, two times input minus target. So with this refactoring, we can now create our model. We can just say, let's create a model class and create something called dot layers with a list of all of our layers, right? Notice I'm not using any PyTorch machinery. This is all from scratch. Let's define loss and then let's define call and it's gonna go through each layer and say X equals LX. So this is how I do that function composition. We're just calling the function on the result of the previous thing, okay? And then at the other very end, call self.loss on that. And then for backward, we do the exact opposite. We go self.loss.backward and then we go through the reversed layers and call backward on each one, right? And remember the backward passes are gonna save the gradient away inside the dot G. So with that, let's just set all of our gradients to none so that we know we're not cheating. We can then create our model, right? This class model and call it and then we can call it as if it was a function because we have done the call, right? So this is gonna call done the call and then we can call backward and then we can check that our gradients are correct. So that's nice. One thing that's not nice is, holy crap, that took a long time. Let's run it. There we go, 3.4 seconds. So that was really, really slow. So we'll come back to that. I don't like Drip-A-Code. There's a lot of Drip-A-Code here, self.mp equals imp returns self.out. That's messy. So let's get rid of it. So what we could do is we could create a new class called module which basically does the self.mp equals imp and returns self.out for us. And so now we're not gonna use done the call to implement our forward. We're gonna have a call something called self.forward which we will initially set to raise exception not implemented. And backward is gonna call self.bwd passing in the thing we just saved. And so now value has something called forward which just has that. So we're now basically back to where we were. And backward just has that, right? So now look how neat that is, right? And we also realized that this thing we were doing to, not this thing, this thing we were doing to calculate the derivative of the output of the linear layer with respect to the weights. Where we're doing an unsqueeze and an unsqueeze. Just basically a big out of product in a sum. We could actually re-express that with ironsum. And when we do that, so our code is now neater and our 3.4 seconds is down to 143 milliseconds. So thank you again to ironsum. So you'll see this now. Look, model equals model. Loss equals la, la.backward. And now the gradients are all there. That looks almost exactly like PyTorch. And so we can see why it's done this way. Why do we have to inherit from nn.module? Why do we have to define forward? This is why, right? Let's PyTorch factor out all this duplicate stuff. So all we have to do is do the implementation. So I think that's pretty fun. And then once we realized, we thought more about it and were like, what are we doing with this ironsum? And we actually realized that it's exactly the same as just doing input.transpose times output. So we replaced the ironsum with a matrix product and that's 140 milliseconds. And so now we've basically implemented nn.linear and nn.module. So let's now use nn.linear and nn.module because we're allowed to, that's the rules. And the forward pass is almost exactly the same speed as our forward pass. And their backward pass is about twice as fast. I'm guessing that's because we're calculating all of the gradients and they're not calculating all of them on either ones they need, but it's basically the same thing. Okay, so at this point, we're ready in the next lesson to do a training loop. We have something, we have a multi-layer fully connected neural network, what the her paper would call a rectified network. We have metrics multiply organized. We have our forward and backward passes. Organized, it's all nicely refactored out into classes and a module class. So in the next lesson, we will see how far we can get. Hopefully we will build a high quality fast ResNet. And we're also gonna take a very deep dive into optimizers and callbacks and training loops and normalization methods. Any questions before we go? No, that's great. Okay, thanks everybody. See you on the forums.