 Hi all, and welcome to lesson 15. And what we're going to endeavor to do today is to create a convolutional auto-encoder. And in the process, we will see why doing that well is a tricky thing to do. And time permitting, we will begin to work on a framework, a deep learning framework to make life a lot easier. I'm not sure how far we'll get on that today, time-wise. So let's see how we go and get straight into it. So, okay, so today, let's start by talking before we can create a convolutional auto-encoder. We need to talk about convolutions and what are they and what are they for? Broadly speaking, convolutions are something that allows us to tell our neural network a little bit about the structure of the problem that's going to make it a lot easier for it to solve the problem. And in particular, the structure of our problem is we're doing things with images. Images are laid out on a grid, a 2D grid for black and white or a 3D for color or a 4D for a color video or whatever. And so we would say, you know, there's a relationship between the pixels going across and the pixels going down. They tend to be similar to each other. Differences in those pixels across those dimensions tend to have meaning. Patterns of pixels that appear in different places often represent the same thing. So for example, a cat in the top left is still a cat, even if it's in the bottom right. These kinds of, these kinds of prior information is something that is naturally captured by a convolutional neural network, something that uses convolutions. Generally speaking, this is a good thing because it means that we will be able to use less parameters and less computation because more of that information about the problem we're solving is kind of encoded directly into our architecture. There are other architectures that don't encode that prior information as strongly, such as a multi-layer perceptron, which we've been looking at so far, or a transformers network, which we haven't looked at yet. Those kinds of architectures could potentially give us, well, they do give us more flexibility. And given enough time, compute, and data, they could potentially find things that maybe CNNs would struggle to find. So we're not always going to use convolutional neural networks, but they're a pretty good starting point and certainly something important to understand. They're not just used for images. We can also take advantage of one-dimensional convolutions for language-based tasks, for instance. So convolutions come up a lot. So in this notebook, one thing you'll notice that might be of interest is we are importing stuff from mini-AI now. Now mini-AI is this little library that we're studying to create. And we're creating it using nbdev. So we've got a mini-AI.training and a mini-AI.datasets. And so if we look, for example, at the Datasets notebook, it starts with something that says that the default export module is called Datasets. And some of the cells have a export directive on them. And at the very bottom, we had something that called nbdev.export. Now, what that's going to do is it's going to create a file called Datasets.py, just here, Datasets.py. And it contains those cells. That we exported. And why is it called mini-AI.datasets? That's because everything for nbdev is stored in settings.any. And there's something here saying create a library, libname called mini-AI. You can't use this library until you install it. Now, we haven't uploaded it to PyPy, like we made it a pip installable package from the public server. But you can actually install a local directory as if it's a Python module that you've kind of installed from the internet. And to do that, you say pip install in the usual way, but you say minus e, that stands to editable. And that means set up the current directory as a Python module. Well, current directory, actually any directory you like, I just put dot to mean the current directory. And so you'll see that's going to go ahead and actually install my library. And so after I've done that, I can now import things from that library, as you see. Okay, so this is just the same as before. We're going to grab our MNIST dataset and we're going to create a convolutional neural network on it. So before we do that, we're going to talk about what are convolutions. And one of my favorite descriptions of convolutions comes from the student in our, I think it was our very first course, Matt Kleinsmith, who wrote this really nice medium article, CNNs from different viewpoints, which I'm going to steal from. And here's the basic idea. Say that this is our image. It's a three by three image with nine pixels labeled from A to J as capital letters. Now, a convolution uses something called a kernel. And a kernel is just another tensor. In this case, it's a two by two matrix. Again, so this one's we're going to have alpha, beta, gamma, delta as our four values in this convolution. Now in this kernel, oh, now one thing I'll mention, I can't remember if I've said this before, is the Greek letters are things that you want to be able to, I think I have mentioned this. You want to be able to pronounce them. So if you don't know how to read these and say what these names are, make sure you head over to Wikipedia or whatever and learn the names of all the Greek letters so that you can, because they come up all the time. Okay, so what happens when we apply a convolution with this two by two kernel to this three by three image? I mean, it doesn't have to be an image. In this case, it's just a rank two tensor. No, but it might represent an image. What happens is we take the kernel and we overlay it over the first little two by two subgrid like so. And specifically what we do is we match color to color. So the output of this first two by two overlay would be alpha times A plus beta times B plus gamma times D plus delta times E. And that would yield some value, P. And that's gonna end up in the top left of a two by two output. So the top right of the two by two output, we're gonna slide, it's like a sliding window, we're gonna slide our kernel over to here and apply each of our coefficients to these respectively colored squares. And then ditto for the bottom left and then ditto for the bottom right. So we end up with this equation. P as we discussed is alpha A plus beta B plus gamma D plus delta E plus some bias term. Q to the top right, as you can see, it's just alpha and this takes times B. And so we just take multiplying them together and adding them up, multiply together, add them up, multiply together and add them up. So we're basically, you can imagine we're basically flattening these out into rank one tensors and defectors and then doing a dot product. Would be one way of thinking about what's happening as we slide this kernel over these windows. And so this is called a convolution. So let's try and create a convolution. So for example, let's grab our training images and take a look at one and let's create a three by three kernel. So remember a kernel is just, we've already, our kernel appears a lot of times in computer science and math. We've already seen the term kernel to mean a piece of code that we run on a GPU across lots of parallel kind of virtual devices or potentially in a grid. There's a similar idea here. We've got a computation, which is in this case kind of this dot product or something like a dot product, sliding over occurring lots of times over a grid. But it's, yeah, it's a bit different. That's kind of another use of the word kernel. So in this case, a kernel is a, in this case, it's going to be a rank two tensor. And so let's create a kernel with these values in the three by three matrix, rank two tensor. And we could draw what that looks like. Not surprising, it just looks like a bunch of lines. Oops. Okay. So what would happen if we slide this over just these nine pixels over this 28 by 28? Well, what's going to happen is if we've got some, the top left, for example, three by three section as these names, then we're going to end up with negative a one because the top three are all negative, right? Negative a one, minus a two, minus a three, the next to just zero, so that won't do anything. And then plus a seven, plus a eight, plus a nine. Why is that interesting? That's interesting. Well, let's try here. What I've done here is I've grabbed just the first 13 rows and first 23 columns of our image. And I'm actually showing the numbers and also using gray kind of conditional formatting, if you like, or the equivalent in pandas to show this top bit. So we're looking at just this top bit. So what happens if we take rows three, four and five? Remember, this is not inclusive, right? So it's rows three, four and five. Columns 14, 15, 16, 14, 15, 16. So we're looking at these three here. What's that going to give us if we multiply it by this kernel? It gives us a fairly large positive value because the three that we have negatives on is the top row, well, they're all zero. And the three that we have positives on, they're all close to one. So we end up with quite a large number. What about the same columns, but for rows seven, eight, nine? Seven, eight, nine. Here, the top is all positive and the bottom is all zero. So that means that we're gonna get a lot of negative terms. And not surprisingly, that's exactly what we see. If we do this, kind of a product equivalent, which all you need in NumPy to do that is just an element-wise multiplication followed by a sum, right? So that's gonna be quite a large negative number. And so perhaps you're seeing what this is doing and maybe you got a hint from the name of the tensor we created. It's something that is going to find the top edge, right? So this one is a top edge, so it's a positive. And this one is a bottom edge, so it's a negative. So we would like to apply that, this kernel, to every single three by three section in here. So we could do that by creating a little apply kernel function that takes some particular row and some particular column and some particular tensor as a kernel and does that multiplication dot sum that we just saw. So for example, we could replicate this one by calling apply kernel. And this here is the center of that three by three grid area. And so there's that same number, 2.97. So now we could apply that kernel to every one of the three by three windows in this 28 by 28 image. So we're going to be sliding over like this red bit sliding over here, but we've actually got a 28 by 28 input, not just a five by five input. So to get all of the coordinates, let's just simplify it to do this five by five. We can create a list comprehension. We can take I through every value in range five and then for each of those, we can take J for every value in range five. And so if we just look at that tuple, you can see we get a list of lists containing all of those coordinates. So this is a list comprehension in a list comprehension, which when you first say it may be surprising or confusing, but it's a really helpful idiom. And I certainly recommend getting used to it. Now, what we're gonna do is we're not just gonna create the cup, this tuple, but we're actually gonna call apply kernel for each of those. So if we go through from one to 27, well actually one to 26 because 27 is exclusive. So we're gonna go through everything from one to 26. And then for each of those go through from one to 26 again and call apply kernel. And that's gonna give us the result of applying that convolutional kernel to every one of those coordinates. And there's the result. And you can see what it's done as we hoped is it is highlighting the top edges. So yeah, you might find that kind of surprising that it's that easy to do this kind of image processing. We're literally just doing an element-wise multiplication and a sum for each window. Okay, so that is called a convolution. So we can do another convolution. This time we could do one with a left edge tensor. So as you can see, it looks just a rotated version or transposed version, I guess, of our top edge tensor. Here's what it looks like. And so if we apply that kernel, so this time we're gonna apply the left edge kernel. And so notice here that we're actually passing in a function, right? We're passing in a function. Sorry, actually not a function, is it? It's just a tensor actually. So we're gonna pass in the left edge tensor for the same list comprehension in a list comprehension. And this time we're getting back to the left edges. Highlighting all of the left edges in the digit. So yeah, this is basically what's happening here is that a two by two can be looped over an image creating these outputs. Now you'll see here that in the process of doing so, we are losing the out of most pixels of our image. We'll learn about how to fix that later. But just for now, notice that as we are putting our three by three through, for example, in this five by five, there's only one to three places that we can put it going across, not five places, because we need some kind of edge. All right, so that's cool. That's a convolution. And hopefully if you remember back to kind of the Xyla and Fergus pictures from lesson one, you might recognize them. From lesson one, you might recognize that the kind of first layer of a convolutional network is often looking for kind of edges and gradients and things like that. And this is how it does it. And then the, and then with convolutions on top of convolutions with non-linear activations between them can combine those into curves or corners or stuff like that and so on and so forth. Okay, so how do we do this quickly? Cause currently this is going to be super, super slow doing this in Python. So one of the very earliest or probably the earliest publicly available general purpose deep learning, GPU accelerated deep learning thing I saw it was called CAFE, that was created by somebody called Yang Qingjia. And he actually described what happened where CAFE, how CAFE went about implementing a fast convolution on a GPU. And basically he said, well, I had two months to do it and I had to finish my thesis. And so I ended up doing something where I said, well, there was some other code out there. Kojewski who you might have come across him and Hinton set up a little startup which Google bought and that kind of became the start of Google's deep learning, the Google brain basically. So Kojewski had all this fancy stuff in his library but Yang Qingjia said, oh, I didn't know how to do all that stuff. So I said, well, I already know how to multiply matrices so maybe I can convert a convolution into a matrix multiplication. And so that I became known as I am to coal. I am to coal is a way of converting a convolution into a matrix multiply. And so actually, I don't know if I suspect Yang Qingjia could have accidentally reinvented it because it actually had been around for a while even at the point that he was writing his thesis, I believe. So it was actually, this is the place I believe it was created in this paper. So that was in 2006, which is a while ago. And so this is actually from that paper. And what they describe is let's say you are putting this two by two kernel over this three by three bit of an image. So here you've got this window needs to match to this bit of this window, right? What you could do is you could unwrap this to one, one, two, sorry, one, two, one, two downwards to here. One, two, one, two to unroll it like so. And you could unroll the kernel here. Yeah, so this is one, two, one, one. So this is bit is here, one, two, one, one. And then you could unroll the kernel, one, one, two, two to here, one, one, two, two. And then once they've been flattened out and moved in that way, and then you'll do exactly the same thing for this next patch here, two, oh, one, three. You flatten it out and put it here, two, oh, one, three. So if you basically take those kernels and flatten them out in this format, then you end up with a matrix multiply. If you multiply this matrix by this matrix, you'll end up with the output that you want from the convolution. So this is basically a way of unrolling your kernels and your input features into matrices such as when you do the matrix multiply, you get the right answer. So it's a kind of a nifty trick. And so that is called I am to call. I guess we're kind of cheating a little bit. Implementing that is kind of boring. It's just a bunch of copying and tensor manipulation. So I actually haven't done it. Instead, I've linked to a numpy implementation, which is here. And it also, part of it is this get indices, which is here. And as you can see, it's a little bit tedious with repeats and tiles and reshapes and whatnot. So I'm not gonna call it homework, but if you wanna practice your tensor indexing manipulation skills, try creating a PyTorch version from scratch. I gotta admit I didn't bother. Instead, I used the one that's built into PyTorch. And in PyTorch, it's called Unfold. So if we take our image and PyTorch expects there to be a batch access and a dimension and a channel dimension. So we'll add two unit leading dimensions to it. Then we can unfold our input for a three by three. And that will give us a nine by 676 input. And so then we can take that, and we can take that. And then we will take our kernel and just flatten it out into a vector. So view changes the shape and minus one just says dump everything into this dimension. So that's gonna create a nine long vector, length nine vector. And so now we can do the matrix model play just like they've done here of the kernel matrix. That's our weights by the unrolled input features. And so that gives us a 676 long. We can then view that as 26 by 26. And we get back as we hoped our left edge tensor result. And so this is how we can kind of from scratch create a better implementation of convolutions. The reason I'm cheating, I'm allowed to cheat here is because we did actually create convolutions from scratch. We're not always creating the GPU optimized versions from scratch, which was never something I promised. So I think that's fair, but it's cool that we can kind of hacker out a GPU optimized version in the same way that the kind of original deep learning library did. So if we use apply kernel, we get nearly nine milliseconds. If we use unfold with matrix model play, we get 20 micro seconds. So that's what about 400 times faster. So that's pretty cool. Now, of course, we don't have to use unfold and matrix model play because PyTorch has a conv 2D. So we can run that. And that interestingly is about the same speed at least on GPU. But this would also work on GPU just as well. Yeah, I'm not sure this will always be the case. In this case, it's a pretty small image. I haven't experimented a whole lot to see whereabouts there's a big difference in speeds between these. Obviously I always just use f.com 2D, but if there's some more tricky convolution, you need to do with some weird thing around channels or dimensions or something. You can always try this unfold trick. It's nice to know it's there, I think. So we could do the same thing for diagonal edges. So here's our diagonal edge kernel or the other diagonal. So if we just grab the first 16 images, then we can do a convolution on our whole batch with all of our kernels at once. So this is a nice optimized thing that we can do. And you end up with your 26 by 26. You've got your four kernels and you've got your 16 images. And so that's summarized here. So that's generally what we're doing to get good GPU acceleration is we're doing a bunch of kernels and a bunch of images all at once across all of their pixels. And so here we go. That's what happens when we take a look at our various kernels for a particular image. Left edge, I guess top edge, and then diagonal top left and top right. Okay, so that is optimized convolutions and that works just as well on CPU or GPU. Obviously, GPU will be faster if you have one. Now, how do we deal with the problem that we're losing one pixel on each side? What we can do is we can add something called padding. And for padding, what we basically do is rather than starting our window here, we start it right over here. And it actually would be up one as well. And so these three on the left here, we just take the input for each of those as zero. So we're basically just assuming that they're all zero. There's other options we could choose. We could assume they're the same as the one next to them. There's various things we can do, but the simplest and the one we normally do is just assume that there's zero. So now, so let's say for example, this is called one pixel padding. Let's say we did two pixel padding. So we had two pixel padding with a five by five input. Okay. And a four by four kernel. So that grays our kernel, right? Then we're going to start right up way over here on the corner, okay? And then you can see what happens as we slide the kernel over. There's all the spots that it's going to take. And so that this dotted line area is the area that we're kind of effectively going through. But all of these white bits, we're just going to treat as zero. And so, and then this is this green is the output size we end up with, which is going to be six by six for a five by five input. I should mention, even numbered edge kernels are not used very often. We normally used odd numbered kernels. If you use, for example, a three by three kernel and one pixel of padding, you will get back the same size you start with. If you use five by five with three pixels of padding, you'll end up with the same size you start with. So generally odd numbered edge size kernels are easier to deal with, to make sure you end up with the same thing you start with. Okay, so yeah, so as it says here, with you've got a odd numbered size KS by KS size kernel, then KS truncate divide two. That's what slash slash means. We'll give you the right size. And so another trick you can do is you don't always have to just move your window across by one each time. You could move it by a different amount each time. The amount you move it by is called the stride. So for example, here's a case of doing a stride two. So with stride two padding one, so we start out here and then we jump across two and then we jump across two and then we go to the next row. So that's called a stride two convolution. Stride two convolutions are handy because they actually reduce the dimensionality of your input by a factor of two. And that's actually what we wanna do a lot. For example, with an auto encoder, we wanna do that. And in fact, for most classification architectures, we do exactly that. We keep on reducing the kind of the grid size by a factor of two again and again and again using stride two convolutions with padding of one. So that's strides and padding. So let's go ahead and create a conf net using these approaches. So we're gonna put, get our size of our training set. This is all the same as before, number of categories, number of digits, size of our hidden layer. So previously with our sequential linear models with our MLPs, we basically went from the number of pixels to the number of hidden and then a value and then the number of hidden to the number of outputs. So here's the equivalent with the convolution. Now the problem is that you can't just do that because the output is not now 10 probabilities for each item in our batch, but it's 10 probabilities for each item in our batch for each 28 by 28 pixels because we don't even have a stride or anything. So you can't just use the same simple approach that we had for MLP. We have to be a bit more careful. So to make life easier, let's create a little con function that does a conv2d with a stride of two optionally followed by an activation. So if act is true, we will add in a value activation. So this is gonna either return a conv2d or a little sequential containing a conv2d followed by a value. And so now we can create a CNN from scratch as a sequential model. And so since activation is true by default, this is gonna take our 28 by 28 image starting with one channel and creating an output of four channels. So this is the number of in. This is the number of filters. Sometimes we'll say filters to describe the number of kind of channels that our convolution has. That's the number of outputs. And it's very similar to the idea of the number of outputs in a linear layer, except this is the number of outputs in your convolution. So what I like to do when I create stuff like this is I add a little comment just to remind myself what is my grid size after this? So I had a 28 by 28 input. So then I've then put it through a stride two con. So the output of this will be 14 by 14. So then we'll do the same thing again, but this time we'll go from a four channel input to an eight channel output. And then from eight to 16. So by this point, we're now down to a four by four and then down to a two by two. And then finally, we're down to a one by one. So on the very last layer, we won't add an activation and the very last layer is gonna create 10 outputs. And since we're now down to a one by one, we can just call flatten and that's gotta remove those unnecessary unit axes. So if we take that, pop our mini batch through it, we end up with exactly what we want, a 16 by 10. So for each of our 16 images, we've got 10 probabilities of each possible digit. So if we take our training set and make it into 28 by 28 images, and we do the same thing for a validation set, and then we create two data sets, one for each, which were called trained data set and valid data set. And we're now gonna train this on the GPU. Now, if you've got a Mac, you can use a device called, well, if you've got a Apple Silicon Mac, you've got a device called MPS, which is gonna use your Mac's GPU, or if you've got NVIDIA, you can use CUDA, which will use your NVIDIA GPU. CUDA's 10 times or more, possibly much more faster than a Mac, so you definitely wanna use NVIDIA if you can. But if you're just running it on a Mac laptop or whatever, you can use MPS. So basically, you're gonna know what device to use. Do we wanna use CUDA or MPS? You can check. If you can check torch.backends.mps.is available to see if you're running on a Mac with MPS. You can check torch.cuda.is available to see if you've got an NVIDIA GPU, in which case you've got CUDA. And if you've got neither, of course you'll have to use the CPU to do computation. So I've created a little function here, to device, which takes a tensor or a dictionary or a list of tensors or whatever, and a device to move it to, and it just goes through and moves everything onto that device, or if it's a dictionary, of things, values moved onto that device. So there's a handy little function. And so we can create a custom collate function, which calls the PyTorch default collation function, and then puts those tensors onto our device. And so with that, we've now got enough to train this neural net on the GPU. We created this getDLs function in the last lesson. So we're gonna use that, passing in the data sets that we just created, and our default collation function. We're gonna create our optimizer using our CNNs parameters. And then we call fit. Now fit, remember, we also created it in our last lesson. And it's done. So then what I did then was I reduced the learning rate by a factor of four and ran it again. And eventually, yeah, I got to a fairly similar accuracy to what we did on our MLP. So yeah, we've got a convolutional network working. I think that's pretty encouraging. And it's nice that to train it, we didn't have to write much code, right? We were able to use code that we had already built. We were able to use the dataset class that we made, the getDLs function that we made, and the fit function that we made. And because those things are written in a fairly general way, they work just as well for a ConvNet as they did for an MLP, nothing had to change. So that was nice. Notice I had to take the model and put it on the device as well. So that will go through and basically put all of the tensors that are in that model onto the MPS or CUDA device, if appropriate. So if we've got a batch size of 64, and as we do one channel, 28 by 28, so then our axes are batch channel height width. So normally this is referred to as NCHW. So N, generally when you see N in a paper or whatever, in this way it's referring to the batch size. N being the number, that's the mnemonic, the number of items in the batch. C is the number of channels height by width, NCHW. TensorFlow doesn't use that. TensorFlow uses NHWC. So we generally call these that channels last since channels are at the end. And this one we normally call channels first. Now, of course, it's not actually channels first. It's actually channel second, but we ignore the batch bit. In some models, particularly some more modern models, it turns out the channels last is faster. So PyTorch has recently added support for channels last and so you'll see that being used more and more as well. All right, so a couple of comments and questions from our chat. The first is Sam Watkins pointing out that we've actually had a bit of a win here, which is that the number of parameters in our CNN is pretty small by comparison. So the number in the MLP version, the number of parameters is equal to basically the size of this matrix, right? So M times NH, right? Oh, plus the number in this, which will be NH times 10. And, you know, something that at some point we probably should do is actually create something that allows us to automatically calculate the number of parameters. And I'm ignoring the bias there, of course. Let's see, what would be a good way to do that? Maybe NP.product, there we go. So what we could do is just calculate this automatically by doing a little list comprehension here. So there's the number of parameters across all of the different layers, so both bias and weights. And then we could, I guess, just, well, we could just use, well, let's use PyTorch. So we could turn that into a tensor and sum it up. Oops. So that's the number in our MLP and then the number in our simple CNN. So that's pretty cool. We've gone down from 40,000 to 5,000 and got about the same number there. Oh, thank you, Jonathan. Jonathan's reminding me that there's a better way than NP.productO.shape, which is just to say O dot number of elements, num EL, same thing. Very nice. Now, one person asked a very good question, which is I thought convolutional neural networks can handle any sized image. And actually, no, this convolutional network cannot handle any sized image. This convolutional neural network only handles images that once they go through these dried two comms, end up with a one by one, because otherwise you can't dot flatten it and end up with 16 by 10. So we will learn how to create conf nets that can handle any sized input, but there's nothing particularly about a conf net that necessitates that it has to be any sized input that it can handle. Okay. So just let's briefly finish this section off by talking about this, yeah, this, particularly I want to talk about the idea of receptive field. Consider this one input channel, four output channel, three by three kernel, right? So that's just to show you what we're doing here. Conv one, well, actually sorry, simple CNN. Simple CNN. This is the model we created. Remember, it was like a sequential model containing sequential models because that's how our conf function worked. So simple CNN zero is our first layer. It contains both the conf and the value. So simple CNN zero zero is the actual conf. So if we grab that, call it conf one. It's a four by one by three by three. So number of outputs, number of input channels and height by width of the kernel. And then it's got its bias as well. So that's how we could kind of deconstruct what's going on with our weight matrices or our parameters inside a convolution. Now, I'm going to switch over to Excel. So in the lesson notes on the course website or on the forum, you'll find we've got an Excel, you'll see we've got an Excel workbook. Oh, what seemed reminded me that there is a nice trick we can do. I do want to do that actually, because I love this trick. Oh, I just deleted everything though. Let's put them all back. Here we go. Which is you actually don't need square brackets. The square brackets is a list comprehension. Without the square brackets, it's called a generator and it, oh, no, you can't use it there. Maybe that only works with num, maybe that only works with numpy. Okay, so, wait, that's the list. No, that doesn't work either. So much for that. I'm kind of curious now, maybe torch.sum, just some. Okay, well, I don't want to use Python sum. That's interesting. I feel like all of them should handle generators, but there you go. Okay, so open up the conv example spreadsheet. And what you'll see on the conv example worksheet page is something that looks a lot like the number seven. And indeed this is the number seven that I got straight from MNIST. Let's see. Okay, so you can see over here, we have a number seven. This is a number seven from MNIST that I have copied into Excel. And then you can see over here, we've got like a top edge kernel being applied. And over here, we've got a right edge kernel being applied. This might be surprising you because you might be thinking, wait a second, Jeremy. If I focus off to Excel, it doesn't do convolutional neural networks. Well, actually it does. So if I zoom in in Excel, you'll see actually these numbers are in fact conditional formatting applied to a bunch of spreadsheet cells. And so what I did was I copied the actual pixel values into Excel and then applied conditional formatting. And so now you can see what the digit is actually made of. So you can see here, I've created our top edge filter. And here I've created our left edge filter. And so here I am applying that filter to that window. And so here you can see it looks a lot like NumPy. It's just a sum product. And you might not be aware of this, but in Excel, you can actually do broadcasting. You have to hit Apple shift, enter or control shift, enter, and it puts these little curly brackets around it. It's called an array formula. It basically lets you do broadcasting or simple broadcasting in Excel. And so this is how I created this top edge filtered version in Excel. And the left edge version is exactly the same, just a different kernel. And as you can see, if I click on it, it's applying this filter to this input area and so forth. Okay, so we can then, I just arbitrarily picked some different values here. And so something to notice now in my second layer, so here's conv1. Here's conv2, it's got a bit more work to do. We actually need two filters because we need to add together this bit here, applied to this with this kernel applied and this bit here with this kernel applied. So you actually need one set of three by three for each input. And also I want to set two separate outputs. So I actually end up needing a two by two by three by three weights matrix or weights a tensor, I should say, which you might remember is exactly what we had in PyTorch. We had a rank for tensor. So if I have a look at this one, you see exactly the same thing. This input is using this kernel applied to here and this kernel applied to here. So that's important to remember that you have these rank for tensors. And so then rather than doing stride two conv, I did something else, which is actually a bit out of favor nowadays, but it's another option, which is to do something called max pooling to reduce my dimensionality. So you can see here I've got 28 by 28. I've reduced it down here to 14 by 14. And the way I did it was simply to take the max of each little two by two area. Okay, so that's all that's been done there. So that's called max pooling. And so max pooling has the same effect as a stride two conv, not mathematically identical, the same effect, which it does a convolution and reduces the grid size by two on each dimension. Okay, so then how do we create a single output if we don't keep doing this until we get to one by one, which I'm too lazy to do in Excel? Well, one approach, and again, this is a little bit out of favor as well, but one approach we can do is we can take every one of these, we've now got 14 by 14 and apply a dense layer to it. And so what I've done here is I've got a big, imagine this has basically all been flattened out into a vector. And so here we've got some product of this by this plus the sum product of this by this, and that gives us a single number. And so that is how we could then optimize that in order to optimize our weight matrices. Now, and then the more modern approach, we don't use this kind of dense layer much anymore, it still appears a bit. The main place that you see this used is in a network called VGG, which is very old now, I thought it might be 2013 or something, but it's actually still used. And that's because for certain things like something called style transfer or in general perceptual losses, people still find VGG seems to work better. So you still actually see this approach nowadays sometimes. The more common approach, however nowadays is we take the penultimate layer and we just simply take the average of all of the activations. So nowadays we would simply, the Excel way of doing it would be literally simply say average of the penultimate layer and that is called global average pulling. Everything has to has a fancy word, a fancy phrase, but that's all it is. Take the average, it's called global average pulling or you could take the max, whatever that would be global max pulling. So anyway, the main reason I wanted to show you this was to do something which I think is pretty interesting, which is to take something in our, I'm just gonna zoom out a little bit here. Let's take something in our max pull here and I'm gonna say trace precedence to show you, here it is, the area that it's coming from. Okay, so it's coming from these four numbers. Now, if I trace precedence again, saying what's actually impacting this, obviously the kernel's impacting it and then you can see that the input area here is a bit bigger and then if I trace precedence again, then you can see the input area is bigger still. So this number here is calculated from all of these numbers in the input. This area in the input is called the receptive field of this unit. And so the receptive field in this case is one, two, three, four, five, six by six, right? And that means that a pixel way up here in the top right has literally no ability to impact that activation. It's not part of its receptive field. If you have a whole bunch of straight two comms, each time you have one, the receptive field is gonna get twice as big. So the receptive field at the end of a deep network is actually very large. But the inputs closest to the middle of the receptive field have the biggest kind of say in the output because they implicitly appear the most often in all of these kind of dot products that are inside this convolutional window. So the receptive field is not just like a single binary on-off thing. Certainly all the stuff that's not got precedence here is not part of it at all. But the closer to the center of the receptive field, the more impact it's gonna have, the more ability it's got to change this number. So the receptive field is a really important concept and, yeah, playing around with Excel's precedent arrows, I think is a nice way to see that, at least in my opinion. And apart from anything else, it's great fun creating a convolutional neural network in Excel. I thought so anyway. Okay, so let's take a seven minute break. I'll see you back after that to talk about a convolutional autoencoder. All right, okay, welcome back. We're gonna have a look now at the autoencoder notebook. So we're just gonna import all of our usual stuff and we've got one more of our own modules to import now as well. And this time we are going to switch to a different, we're gonna switch to a different data set, which is the fashion MNIST data set. We can take advantage of the stuff that we did in O5 data sets and the hugging phase stuff to load it. So we've seen this a little bit before back in our data sets one here. And we never actually built any models with it. So let's first of all do that. So this is just going to convert each thing, each image into a tensor. And that's gonna be an in place transform. Remember, we created this decorator. And so we can call data set dictionary with transform. This is all stuff we've done before. And so here we have our example of sneaker, all right. And we will create our collation function collating the dictionary for that data set. That's something to remind, you should remind yourself we built that ourselves in the data sets notebook. And let's actually make our collate function something that does to device, which we wrote in our last notebook. And we'll create a little data loader's function here, which is going to go through each item in the data set dictionary and create a data loader for it and give us a dictionary of data loaders. Okay, so, okay. So now we've got a data load of a training and a data load of a validation. So we can grab the X and Y batch by just calling next on that iterator as we've done before. We can grab the, let's look at each of these in turn, actually. We've done all this before, but it's a couple of weeks ago. So just to remind you, we can get the names of the features. And so we can then create an item getter for our Ys. And we can, so we'll call that the label getter. We can apply that to our labels to get the titles of everything in our mini batch. And we can then call our show images that we created with that mini batch, with those titles. And here we have our fashion MNIST mini batch. Okay, so let's create a classifier. And we're just going to use exactly the same code copy and posted from the previous notebook. So here is our sequential model and we are going to grab the parameters of the CNN and the CNN, I've actually moved it over to the device. The default device was what we created in our last notebook. And as you can see, it's fitting. Now our first problem is it's fitting very slowly, which is kind of annoying. So why is it running pretty slowly? Let's think about, let's have a look at our data set. So when it's finally finished, let's take a look at an item from the data set. Actually, let's start looking at the data set. Let's actually go all the way back to the data set dictionary. So before it gets transformed, data set dictionary. And let's grab the training part of that and let's grab one item. And actually we can see here the problem. For MNIST, we had all of the data loaded into memory into a single big tensor. But this hugging face one is created in a much more kind of normal way, which is each image is a totally separate PNG image. It's not all pre-converted into a single thing. Why is that a problem? Well, the reason it's a problem is that our data loader is spending all of its time decoding these PNGs. So for train here, okay, so while I'm training, I can type htop and you can see that basically my CPU is 100% used. Now that's a bit weird because I've actually got 64 CPUs. Why is it using just one of them as the first problem? But why does it matter that it's using 100% CPU? Well, the reason it matters, let's run it again so you can see. Why does it matter that our CPU is 100% and why is it making it so slow? Well, the reason why is if we look at NVIDIA SMI that will monitor our GPU's utilization. I've got three GPUs, so I say to choose just the zero index one. And you'll see this column here SM. This stands for symmetric multi-processor. It's the equivalent of like CPU usage. And generally we're only using up 1% of our one GPU. So no wonder it's so slow. So the first thing we wanna do then is try to make things faster. Now to make things faster, we wanna be using more than one CPU to decode our PNGs. And as it turns out, that's actually pretty easy to do. You just have to add a extra argument to your data loaders, which is here, num underscore workers. And so I can say use eight CPUs, for example. Now if I create, I recreate a data loaders and then try to create, get the next one. Uh-oh, now I've got an error. And the error is rather quirky. And what it's saying is, oh, you're now trying to use multiple processes. And generally in Python and in PyTorch, using multiple processes, things start to get complicated. And one of the things that absolutely just doesn't work is you can't actually have your data loader put things onto the GPU in your separate processes. It just doesn't work. So the reason for this error is actually because of the fact that we used a collate function that put things on the device. That's incompatible, unfortunately, with using multiple workers. So that's a problem. And the answer to that problem, sadly, is that we would have to actually rewrite our fit function entirely. So there's annoying thing number one. And we don't want to be rewriting our fit function again and again. We want to have a single fit function. So, okay. So there's a problem that we're gonna have to think about. Problem number two is that this is not very accurate. 87%. Well, I mean, is it accurate? It's easy enough to find out. There's a really nice website called Papers with Code and it will tell you at a little later board. And we can see whether we're already good. And the answer is we're not very good at all. So these papers had 96%, 94%, 92%. So, yeah, we're not looking great. So how do we improve that? There's a lot of things we could try but pretty much all of them are going to involve modifying our fit function again and in really simply complicated ways. So we still got a bit of an issue there. Let's put that aside because what we actually wanted to do is create an autoencoder. So to remind you about what an autoencoder is and we're gonna be able to go into a bit more detail now we're gonna start with our input image which is gonna be 28 by 28. So it's a number three, right? And it's a 28 by 28. And we're gonna put it through for example, a stride two conv, stride two. And that's going to have an output of a 14 by 14. And we can have more channels. So say maybe four, so this is 28 by 28 by one. Let's do 14 by 14 by two. So we've reduced the height and width by two but added an extra channel. So overall, this is a two X decrease in parameters. And then we could do another stride two conv and that would give us a seven by seven. And again, we can choose however many channels we want but let's say we choose four. So now compared to our original we've now got a times four reduction. And so we could do that a few times or we could just stay there. And so this is compressing. And so then what we could do is then somehow have a convolution layer or group of layers which does a convolution and also increases the size. There is actually something called a transposed convolution which I'll leave you to look up if you're interested which can do that. Also known as a rather weirdly a stride one half convolution but there's actually a really simple way to do this which is to say, let's say you've got a bunch of pixels. Here's our, let's say we've got a three by three pixels that looks like this. One, zero, one, one, say. We could make that into a six by six very easily which is we could simply, let's get these out. We could simply copy that pixel there into the first four, copy that pixel there into these four. And so you can see and then copy this pixel here into these four. And so we're simply turning each pixel into four pixels. And so this is called nearest neighbor upsampling. Now that's not a convolution, that's just copying but what we could then do is we could then apply a stride one convolution to that. And that would allow us to double the grid size with a convolution and that's what we're gonna do. So our order encoder is gonna need a deconvolutional layer and that's gonna contain two layers. Upsampling nearest neighbor, scale factor of two followed by a com2d with a stride of one. Okay, and you can see for padding I just put kernel size slash slash two so that's a truncating delet division because that always works for any odd sized kernel. As before we will have an optional activation function and then we will create a sequential using star layer. So that's gonna pass in each layer as a separate argument which is what sequential expects. Okay, so let's write a new fitness function. It goes through, I'll just basically copied it over from our previous one going through each epoch but I've pulled out a vowel into a separate function but it's basically doing the same thing. Okay, so here is our auto encoder and so we're going to, it's a bit tricky because I wanted to go down by one, two, three to get to a four by four by eight but starting at 28 by 28 you can't divide that three times and get an integer. So what I first do is I zero pad so add padding of two on each side to get a 32 by 32 input. So if I then do a conv with two channel output that gives us 16 by 16 by two and then again to get an eight by eight by four and then again to get a four by four by eight. So this is doing an 8x compression and then we can call deconv to do exactly the same thing in reverse the final one with no activation and then we can truncate off those two pixels the edge slightly surprisingly pi torch that's you pass negative two to zero padding to crop off the final two pixels and then we'll add a sigmoid which will force everything to go between zero and one which of course is what we need and then we will use MSC loss to compare those pixels to our input pixels and so a big difference we've got here now is that our loss function is being applied to the output of the model and itself. We don't have YB here, we have XB. So we're trying to recreate our original and again this is a bit annoying that we have to create our own fit function. Anyway, so we can now see what is the MSC loss and it's not like gonna be particularly human readable but it's a number we can see if it goes down and so then we can create then we can do our SGD with the parameters of our auto encoder with MSC loss call that fit function we just wrote and I won't wait for it to run cause as you can see it's really slow for reasons we've discussed. I've run it before and what we want is to see that the original which is here gets recreated and the answer is, oh, not really. I mean, roughly the same things but there's no point having an auto encoder which can't even recreate the originals. The idea would be that if these looked almost identical to these then we'd say, wow, this is a fantastic network compressing things by eight times. So I found this like very fiddly to try and get this to work at all. Something that I discovered can get it to start training is to start with a really low learning rate for a few epochs and then increase the learning rate after a few epochs. I mean, at least it gets it to train and show something vaguely sensible but, let's see, yeah, it still looks pretty crummy. This one here, I got actually by switching to Adam and I actually removed the tricky bit. I removed these two as well. But yeah, I couldn't get this to like recreate anything very reasonable or any reasonable amount of time. And, you know, why is this not working very well? There's so many reasons it could be, you know, like, do we need a better optimizer? Do we need a better architecture? Do we need to use a variational autoencoder? You know, there's a thousand things we could try, but, you know, doing it like this is gonna drive us crazy. We need to be able to really rapidly try things and all kinds of different things. And so what I often see, you know, in projects or on Kaggle or whatever, people's code looks kind of like this. It's all like manual. And then their iteration speed is too slow. We need to be able to really rapidly try things. So we're not gonna keep doing stuff manually anymore. This is where we take a halt and we say, okay, let's build up a framework that we can use to rapidly try things and to understand when things are working and when things aren't working. So we're gonna start creating a learner. So what is a learner? It's basically the idea is this learner is gonna be something that we build which will allow us to try like anything that we can imagine very quickly. And we will build that on top of that learner, things that will allow us to introspect what's going on inside a model, will allow us to do multi-process CUDA to go fast. It will allow us to add things like data augmentation. It will allow us to try a wide variety of architectures quickly and so forth. So that's gonna be the idea. And of course, we're gonna create it from scratch. And so let's start with fashion MNIST as before. And let's create a data loader's class which is gonna look a bit like what we had before where we're just going to pass in, this couldn't be simpler, right? We're just gonna pass in two data loaders and store them away. And I'm gonna create a class method from data set dictionary. And what that's gonna do is it's gonna call data loader on each of the data set dictionary items with our batch size and instantiate our class. So if you haven't seen class method before, it's what allows us to say data loaders dot something in order to construct this. We could have put this in it just as well, but we'll be building more complex data loader's things later. So I thought we might start by kind of getting the basic structure right. So this is all pretty much the same as what we've had before. I'm not doing anything on the device here because as we know, that didn't really work. Okay. Oh, this is an old thing. I don't need to coup to anymore. So we're gonna use to device which I think came from, there we go. So here's an example of a very simple learner that fits on one screen. And this is basically gonna replace our fit function. So a learner is gonna be something that is going to train or learn a particular model using a particular set of data loaders, a particular loss function, some particular learning rate and some particular optimizer or some particular optimization function. Now, normally I, most people would often kind of store each of these away separately by writing like self.model equals model, blah, blah, blah. And as I think we've talked about before, that's that kind of huge amounts of boilerplate. It's more stuff that you can get wrong and it's more stuff to mean that you have to read to understand the code and don't like that kind of repetition. So instead, we just call fastcore.store atria to do that all in one line. Okay, so that's the basic idea with a class is to think about what's the information it's gonna need. So you pass that all to the constructor, store it away and then our fit function is going to, we've got the basic stuff that we have for keeping track of accuracy. So this is only work for stuff that's a classification where we can use accuracy. Put the model on our device, create the optimizer, store how many epochs we're going through. Then for each epoch, we'll call the one epoch function. And the one epoch function, we're gonna either do train or evaluation. So we pass in true if we're training and false if we're evaluating. And they're basically almost the same. We basically set the model to training mode or not. We then decide whether to use the validation set or the training set based on whether we're training. And then we go through each batch in the data loader and call one batch. And one batch is then the thing which is going to put our batch onto the device, call our model, call our loss function. And then if we're training, then do our backward step, our optimizer step in our zero gradient. And then finally calculate our metrics or our stats. And so here's where we calculate our metrics. So that's basically what we have there. So let's go back to using an MLP. We call fit and away it goes. This is an error here. Pointed out by Kevin. Thank you self.model.2. One thing I guess we could try now is we think that maybe we can use more than one process. So let's try that. Oh, it's so fast. I didn't even see. There it goes. You can see all forced CPUs being used at once. Bang, it's done. Okay, so that's pretty great. Let's see how fast it looks here. Bomp, bump. All right, lovely. Okay, so that's a good sign. We've got a learner that can fit things, but it's not very flexible. It's not gonna help us, for example, with our autoencoder because there's no way of changing which things are used for predicting with or for calculating with. We can't use it for anything except things that involve accuracy with the binary classification. Sorry, a multi-class classification. It's not flexible at all, but it's a start. And so I wanted to basically put this all on one screen so you can see what the basic learner looks like. All right, so how do we do things other than multi-class accuracy? I decided to create a metric class and basically a metric class is something where we are going to define subclasses of it that calculate particular metrics. So for example, here I've got a subclass of a metric called accuracy. So if you haven't done subclasses before, you can basically think of this as saying, please copy and paste all the code from here into here for me, but the bit that says def calc, replace it with this version. So in fact, this would be identical to copying and pasting this whole thing, typing accuracy here and replacing the definition of calc with that. That's what is happening here when we do subclassing. So it's basically copying and pasting all that code in there for us. It's actually more powerful than that. There's more we can do with it, but in this case, this is all that's happening with this subclassing. And this is called, actually I'll leave that, that's fine. Okay, so the accuracy metric is here and then this is kind of our really basic metric, which is we're gonna use for just for loss. And so what happens is we're going to, let's for example, create an accuracy metric object. We're basically gonna add in mini batches of data. So for example, here's a mini batches of inputs and predictions. Here's another mini batch of inputs and predictions. And then we're gonna call dot value and it will calculate the accuracy. Now dot value is a neat little thing. It doesn't require parentheses after it because it's called a property. And so a property is something that just calculates automatically without having to put parentheses. That's all a property is, well, a property getter anyway. And so they look like this, you give it a name. And so we are going to be, each time we call add, we are going to be storing that input and that target and also the number of items in the mini batch, optionally, for now that's just always gonna be one. And you can see here that we then call dot calc, which is gonna call the accuracy calc. So just see how often they equal. And then we're going to append to the list of values, that calculation. And we're also gonna append to the list of ends. In this case, just one. And so then to calculate the value, we just do that. So that's all that's happening for accuracy. And then we can do for loss, we can just use metric directly because metric directly will just calculate the average of whatever it's passed. So we can say, oh, add the number 0.6. So the target's optional. And we're saying this is a mini batch of size 32. So it's gonna be the end. And then add the value 0.9 with a mini batch size of two and then get the value. And as you can see, that's exactly the same as the weighted average of 0.6 and 0.9 with weights of 32 and two. So we've created a metric class. And so that's something that we can use to create any metric we like just by overriding calc. Or we could create totally things from scratch as long as they have an add and a value. Okay, so we're now going to change our learner. And what we're gonna do is we're going to keep the same basic structure. So there's gonna be fit. It's gonna go through each epoch. It's gonna call one epoch passing in true and false as for training and validation. One epoch is going to go through each batch in the data loader and call one batch. One batch is going to do the prediction, get loss. And if it's training, it's gonna do the backward step and zero grid. But there's a few other things going on. So let's take a look. Well, actually, let's just look at it in use first. So when we use it, we're gonna be creating a learner with the model, data loaders, loss function, learning rate and some callbacks which we'll learn about in a moment. And we call fit and it's gonna do our thing. And look, we're gonna have charts and stuff. All right, so the basic idea is gonna look very similar. So we're gonna call fit. So when we construct it, we're gonna be passing in exactly the same things as before, but we've got one extra thing, callbacks, which we'll see in a moment, store the attributes as before. And we're gonna be doing some stuff with the callbacks. So when we call fit for this number of epochs, we're gonna store away how many epochs we're gonna do. We're also gonna store away the actual range that we're going to loop through as self.epox. So here's that looping through self.epox. We're gonna create the optimizer using the optimizer function and the parameters. And then we're gonna call underscore fit. Now, what on earth is underscore fit? Why didn't we just copy and paste? So this into here, why do this? It's because we've created this special decorator with callbacks. What does that do? So it's up here with callbacks. With callbacks is a class. It's gonna just store one thing, which is the name. In this case, the name is fit. And what it's gonna do is, now this is the decorator, right? So when we call it, remember, decorators get past a function. So it's gonna get past this whole function. And that's gonna be called F, right? So done to call, remember is what happens when a class is treated or an object is treated as if it's a function. So it's gonna get past this function. So this function is underscore fit. And so what we wanna do is we wanna return a different function. It's going to, of course, call the function that we were asked to call using the arguments and keyword arguments we were asked to use. But before it calls that function, it's going to call a special method called callback, passing in the string before, in this case, before underscore fit. After it's completed, it's gonna call that method called callback, and passing the string after underscore fit. And it's gonna wrap the whole thing in a try accept block. And it's going to be looking for an exception called cancel fit exception. And if it gets one, it's not gonna complain. So let me explain what's going on with all of those things. Let's look at example of a callback. So for example, here is a callback called deviceCB, device callback. And before fit will be called automatically before that underscore fit method is called. And it's going to put the model onto our device, CUDA or MPS if we have one, otherwise it'll just be on GPU. So what's gonna happen here? So it's going to call, we're gonna call fit. It's gonna go through these lines of code. It's then gonna call underscore fit. Underscore fit is not this function. Underscore fit is this function with F is this function. So it's going to call our learner.callback passing in before underscore fit. And callback is defined here. What's callback gonna do? It's gonna be past the string before underscore fit. It's going to then go through each of our callbacks sorted based on their order. And you can see here, our callbacks can have an order. And it's going to look at that callback and try to get an attribute called before underscore fit. And it will find one. And so then it's going to call that method. Now if that method doesn't exist, it doesn't appear at all, then getAtra will return this instead. Identity is a function just here. This is an identity function. All it does is whatever arguments it gets passed, it returns them. And if it's not passed any arguments, it just returns. So there's a lot of Python going on here. And that is why we did that foundations lesson. And so for people who haven't done a lot of this Python, there's gonna be a lot of stuff to experiment with and learn about. And so do ask on the forums if any of these bits get confusing, but the best way to learn about these things is to open up this Jupyter Notebook and try and create really simple versions of things, right? So for example, let's try identity, identity. How exactly does identity work? I can call it and it gets nothing. I can call it with one, it gets back one. I could call it with a, it gets back a. I could call it with a one, call it with a one and get a one. And how is it doing that exactly? So remember, we can add a breakpoint and this would be a great time to really test your debugging skills. Okay, so remember in our debugger, we can hit H to find out what the commands are, but you really should do a tutorial on the debugger if you're not familiar with it. And then we can step through each one. So I can now print args. And there's actually a trick which I like is that args is actually a command funnily enough, which will just tell you the arguments to any function regardless of what they're called, which is kind of nice. And so then we can step through by pressing N and after this, we can check like, okay, what is X now? And what is args now, right? So remember to really experiment with these things. So anyway, we're going to talk about this a lot more in the next lesson. But before that, if you're not familiar with try accept blocks, you know, spend some time practicing them. If you're not familiar with decorators, well, we've seen them before. So go back and look at them again really carefully. If you're not familiar with the debugger practice with that, if you haven't spent much time with get atra, remind yourself about that. So try to get yourself really familiar and comfortable as much as possible with the pieces. Because if you're not comfortable with the pieces, then the way we put the pieces together is going to be confusing. There's actually something in education in kind of the theory of education called cognitive load theory. And the theory of cognitive, basically cognitive load theory says, if you're trying to learn something, but your cognitive load is really high because of all lots of other things going on at the same time, you're not going to learn it. So it's going to be hard for you to learn this framework that we're building, if you have too much cognitive load of like what the hell's a decorator or what the hell's get atra or what does sordid do? Or what's partial, all these things. Now, I actually spent quite a bit of time trying to make this as simple as possible, but also as flexible as it needs to be for the rest of the course. And this is as simple as I could get it. So these are kind of things that you actually do have to learn. But in doing so, you're going to be able to write some really powerful and general code yourself. So hopefully you'll find this a really valuable and mind-expanding exercise in bringing high-level software engineering skills to your data science work. Okay, so with that, this looks like a good place to leave it and look forward to seeing you next time. Bye.