 Hi everybody and welcome back. This is lesson 10 of practical deep learning for coders. It's the second lesson in part two, which is where we're going from deep learning foundations to stable diffusion. So before we dive back into our notebook, I think first of all, let's take a look at some of the interesting work that students in the course have done over the last week. I'm just going to show a small sample of what's on the forum. So check out the share your work here thread on the forum for many, many, many more examples. So Peru did something interesting, which is to create a bunch of images of doing a linear interpolation. I mean, details actually spherical linear interpolation, but it doesn't matter. Doing a linear interpolation between two different latent noise starting points for an auto picture. And then showed all the intermediate results. That came out pretty nice. And then did something similar starting with an old car prompt and going to a modern Ferrari prompt. I can't remember exactly what the prompts were, but you can see how as it kind of goes through that latent space, it actually is changing the image that's coming out. I think that's really cool. And then I love the way Namrata took that and took it to another level in a way, which is starting with a dinosaur and turning into a bird. And this is a very cool intermediate picture of one of the steps along the way, the dino bird. I love it. Dino chick. Fantastic. So much creativity on the forums. I love this. John Richmond took his daughter's dog and turned it gradually into a unicorn. And I thought this one along the way actually came out very, very nicely. I think this is adorable. And I suspect that John has won the dad of the year or dad of the week maybe award this week for this fantastic project. And Maureen did something very interesting, is she took Johnno's parrot image from his lesson and tried bringing it across to various different painters styles. And so her question was, anyone want to guess the artists in the prompts? So I'm just going to let you pause it before I move on if you want to try to guess. And there they are. Most of them pretty obvious, I guess. I think it's so funny that Frida Kahlo appears in all of her paintings. So the parrot's actually turned into Frida Kahlo. All right, not all of her paintings, but all of her famous ones. So the very idea of a Frida Kahlo painting without her in it is so unheard of that the parrot's turned into Frida Kahlo. And I like this Jackson Pollock. It's still got the parrot going on there. So that's a really lovely one, Maureen. Thank you. And this is a good reminder to make sure that you check out the other two lesson videos. So she was working with Johnno's Stable Diffusion lesson. So be sure to check that out if you haven't yet. It is available on the course webpage and on the forums and has lots of cool stuff that you can work with, including this parrot. And then the other one to remind you about is the video that Wasim and T'nish did on the math of diffusion. And I do want to read out what Alex said about this because I'm sure a number of you feel the same way. My first reaction on seeing something with the title Math of Diffusion was to assume that, oh, that's just something for all the smart people who have PhDs in maths on the course. And it'll probably be completely incomprehensible. But of course, it's not that at all. So be sure to check this out even if you don't think of yourself as a math person. I think it's, you know, some nice background that you may find useful. It's certainly not necessary, but you might. Yeah, I think it's kind of useful to start to dig in some to some of the math at this point. One particularly interesting project that's been happening during the week is from Jason Antich, who is a bit of a legend around here. Many of you will remember him as being the guy that created Dealtify and actually work closely with us on our research, which together turned into Nogan and Decrapify and other things, created lots of papers. And Jason, yeah, Jason has kindly joined our middle research team working on the stuff for these lessons and for developing a kind of fast AI approach to stable diffusion. And he took the idea that I prompted last week, which is maybe we should be using classic optimizers rather than differential equation solvers. And he actually made it work incredibly well already within a week. These faces were generated on a single GPU in a few hours from scratch by using classic deep learning optimizers, which is like an unheard of speed to get this quality of image. And we think that this research direction is looking extremely promising. So really great news there. And thank you, Jason, for this fantastic progress. Yeah, so maybe we'll do a quick reminder of what we looked at last week. So last week I used a bit of a mega one note hand drawn thing. I thought this week I might just turn it into some slides that we can use. So the basic idea, if you remember, is that we started with, if we're doing handwritten digits, for example, we'd start with a number seven. This would be one of the ones with a stroke through it that some countries use. And then we add to it some noise. And the seven plus the noise together would equal this noisy seven. And so what we then do is we present this noisy seven as an input to a unit. And we have it try to predict which pixels a noise, basically, or predict the noise. And so the unit tries to predict the noise from the number. It then compares its prediction to the actual noise. And it's going to then get a loss, which it can use to update the weights in the unit. And that's basically how stable diffusion, the main bit, if you like, the unit is created. To make it kind of easier for the unit, we can also pass in an embedding of the actual digit, the actual number seven. So for example, a one hot encoded vector, which goes through an embedding layer. And the nice thing about that to remind you is that if we do this, then we also have the benefit that then later on we can actually generate specific digits by saying I want a number seven or I want a number five and it knows what they look like. I've skipped over here the VAE latency piece, which we talked about last week. And to remind you, that's just a computational shortcut. It makes it faster. And so we don't need to include that in this picture because it's just a computational shortcut that we can preprocess things into that latent space with the VAE first, if we wish. So that's what the unit does. Now then to remind you, we want to handle things that are more interesting than just the number seven. We want to actually handle things where we can say, for example, a graceful swan or a scene from Hitchcock. And the way we do that is we turn these sentences into embeddings as well. And we turn them into embeddings by trying to create embeddings of these sentences, which are similar as possible to embeddings of the photos or images that they are connected with. And to remind you, the way we did that or the way that was done originally as part of this thing called clip was to basically download from the internet lots of examples of lots of images, find their alt tags. And then for each one, we then have their image and its alt tag. So here's the graceful swan and its alt tag. And then we build two models, an image encoder that turns each image into some feature vector. And then we have a text encoder that turns each piece of text into a bunch of features. And then we create a loss function that says that the features for a graceful swan, the text, should be as close as possible to the features for the picture of a graceful swan, and specifically we take the dot product. And then we add up all the green ones because these are the ones that we want to match and we subtract all the red ones because those are the ones we don't want to match. These are where the text doesn't match the image. And so that's the contrastive lost, which gives us the cl in clip. So that's a review of some stuff we did last week. And so with this, then we can, we now have a text encoder, which we can now say a graceful swan and it will spit out some embeddings. And those are the embeddings that we can feed into our unit during training. And so then we don't, haven't been doing any of that training ourselves except for some fine tuning because it takes a very long time on a lot of computers. But instead we take pre-trained models and do inference. And the way we do inference is we put in an example of the thing that we want, that we have an embedding for. So let's say we're doing handwritten digits and we put in some random noise into the unit. And then it spits out a prediction of which bits of noise you could remove to leave behind a picture of the number three. Initially, it's going to do quite a bad job of that. So we subtract just a little bit of that noise from the image to make it a little bit less noisy and we do it again. And we do it a bunch of times. So here's what that looks like. Creating a, I think somebody here did a smiling picture of Jeremy Howard or something, if I remember correctly. And if we print out the noise at kind of step zero and at step six and at step 12, you can see the first signs of a face starting to appear. Definitely a face appearing here, 18, 24. By step 30, it's looking much more like a face. By 42, it's getting there. It's just got a few little blemishes to fix up. And here we are. I think I've slightly messed up my indexes here because it should finish at 60, not 54, but such is life. So rather rosy red lips too, I would have to say. So remember in the early days, this took a thousand steps and now there are some shortcuts to make it take 60 steps. And this is what the process looks like. And the reason this doesn't look like normal noise is because now we are actually doing the VAE latency thing. And so noisy latency don't look like Gaussian noise. They look like, well, they look like this. This is what happens when you decode those noisy latency. Now you might remember last week, I complained that things are moving too quickly. And there was a couple of papers that had come out the day before and made everything entirely out of date. So Jono and I and the team have actually now had time to read those papers. And I thought now would be a good time to start going through some papers for the first time. So what we're actually going to do is show how these papers have taken the required number of steps to go through this process down from 60 steps to four steps, which is pretty amazing. So let's talk about that. And the paper specifically is this one progressive distillation for fast sampling of diffusion models. So it's only been a week, so I haven't had much of a chance to try to explain this before. So apologies in advance if this is awkward, but hopefully it's going to make some sense. What we're going to start with is... So we're going to start with this process, which is gradually denoising images. And actually I wonder if we can copy it. Okay, so how are we going to get this down from 60 steps to four steps? The basic idea is that we're going to do a process we're going to do a process called distillation, which I have no idea how to spell, but hopefully that's close enough that you get the idea. Distillation is a process which is pretty common in deep learning. And the basic idea of distillation is that you take something called a teacher network, which is some neural network that already knows how to do something, but it might be slow and big. And the teacher network is then used by a student network, which tries to learn how to do the same thing, but faster or with less memory. And in this case, we want ours to be faster. We want to do less steps. And the way we can do this conceptually, it's actually, in my opinion, reasonably straightforward. We have, like, when I look at this, and I think like, wow, you know, neural nets are really amazing. So given neural nets are really amazing, why is it taking, like, 18 steps to go from there to there? Like, that seems like something that you should be able to do in one step. The fact that it's taking 18 steps, and originally, of course, that was hundreds and hundreds of steps, is because it's kind of, that's just a kind of a side effect of the math of how this thing was originally developed, you know, this idea of this diffusion process. But the idea in this paper is something that actually we've, I think I might have even mentioned in the last lesson, it's something we were thinking of doing ourselves before this paper beat us to it, which is to say, well, what if we train a new model where the model takes as input this image, right, and puts it through some other unit, B. Okay. And then that spits out some result. And what we do is we take that result and we compare it to this image, the thing we actually want. Because the nice thing is now, which we've never really had before, is we have for each intermediate output, like the desired goal where we're trying to get to. And so we could compare those two just using, you know, whatever, mean squared error, keep on forgetting to change my pen, mean squared error. And so then if we keep doing this for lots and lots of images and lots of lots of pairs and exactly this way, this unit is going to hopefully learn to take these incomplete images and turn them into complete images. And that is exactly what this paper does. It just says, okay, now that we've got all these examples of showing what step 36 should turn into at step 54, let's just feed those examples into a model. And that works. And you'd kind of expect it to work, because you can see that like a human would be able to look at this. And if they were a competent artist, they could turn that into a well-finished product. So you would expect that a computer could as well. There are some little tweaks around how it makes this work, which I will briefly describe, because we need to be able to go from kind of step one through to step 10 through to step 20 and so forth. And so the way that it does this, it's actually quite clever. What they do is they initially, so they take their teacher model. So remember, the teacher model is one that has already been trained. So the teacher model already is a complete stable diffusion model. That's finished. We take that as a given and we put in our image. Well, actually it's noise. We put in our noise. And we put it through two time steps. And then we train our unit B or whatever you want to call it to try to go directly from the noise to time step number two. And that's pretty easy for it to do. And so then what they do is they take this. Okay. And so this thing here remembers called the student model. They then say, okay, let's now take that student model and treat that as the new teacher. So they now take their noise and they run it through the student model twice, once and twice. And they get out something at the end. And so then they try to create a new student, which is a copy of the previous student and it learns to go directly from the noise to two goes of the student model. And they won't be surprised to hear they now take that new student model and use that to go two goes. And then they learn, they use that, then they copy that to become the next student model. And so they're doing it again and again and again. And each time they're basically doubling the amount of work. So it goes one to two. Effectively, it's then going two to four and then four to eight. And that's basically what they're doing and they're doing it for multiple different time steps. So the single student model is learning to both do these initial steps, trying to jump multiple steps at a time. And it's also learning to do these later steps, multiple steps at a time. And that's it, believe it or not. So this is a sneak paper that came out last week. And that's how it works. Now, I mentioned that there was actually two papers. The second one is called On Distillation of Guided Diffusion Models. And the trick now is this second paper, these came out at basically the same time, if I remember correctly, even though they build on each other from the same teams, is that they say, okay, this is all very well, but we don't just want to create random pictures. We want to be able to do guidance, right? And you might remember, I hope you remember from last week that we used something called Classifier Free Guided Diffusion Models, which, because I'm lazy, we will just use an acronym, Classifier Free Guided Diffusion Models. And this one, you may recall, let's say we want a cute puppy. We put in the prompt cute puppy into our clip text encoder, and it spits out an embedding. And we put that, let's ignore the VAE Latents Business. We put that into our unit. But we also put the empty prompt into our clip text encoder. We concatenate these things two together so that then out the other side, we get back two things. We get back the image of the cute puppy, and we get back the image of some arbitrary thing. Could be anything. And then we effectively do something very much like taking the weighted average of these two things together, combine them. And then we use that for the next stage of our diffusion process. Now, what this paper does is it says, this is all pretty awkward. We end up having to train two images instead of one. And for different types of levels of guided diffusion, we have to like do it multiple different times. It's all pretty annoying. How do we skip it? And based on the description of how we did it before, you may be able to guess. What we do is we do exactly the same student teacher distillation we did before, but this time we pass in, in addition, the guidance. And so again, we've got the entire stable diffusion model, the teacher model available for us. And we are doing actual CFGD, classifier free guided diffusion, to create our guided diffusion puppy pictures. And we're doing it for a range of different guidance scales. So you might be doing two and 7.5 and 12 and whatever, right? And those now are becoming inputs to our student model. So the student model now has additional inputs. It's getting the noise, as always. It's getting the caption or the prompt, I guess I should say, as always, but it's now also getting the guidance scale. And so it's learning to find out how all of these things are handled by the teacher model. Like what does it do after a few steps each time? So it's exactly the same thing as before, but now it's learning to use the classifier free guided diffusion as well. OK, so that's got quite a lot going on there. And if it's a bit confusing, that's OK. It is a bit confusing. And what I would recommend is you check out the extra information from Jono, who has a whole video on this. And one of the cool things actually about this video is it's actually a paper walkthrough. And so part of this course is hopefully we're going to start reading papers together. Reading papers is extremely intimidating and overwhelming for all of us all of the time. At least for me, it never gets any better. There's a lot of math. And by watching somebody like Jono, who's an expert at this stuff, read through a paper, you'll kind of get a sense of how he is skipping over lots of the math, right? To focus on, in this case, the really important thing, which is the actual algorithm. And when you actually look at the algorithm, you start to realize it's basically all stuff, nearly all stuff, maybe all stuff that you did in primary school or secondary school. So we've got division, sampling from a normal distribution, so high school, subtraction, division, division, multiplication, right? Oh, OK, we've got a log there. But basically, there's not too much going on. And then when you look at the code, you'll find once you turn this into code, of course, it becomes even more understandable if you're somebody who's more familiar with code like me. So yeah, definitely check out Jono's video on this. So another paper came out about three hours ago. And I just had to show you it to you because I think it's amazing. And so this is definitely the first video about this paper because it only came out a few hours ago. But check this out. This is a paper called iMagic. And with this algorithm, you can pass in an input image. So this is just a photo you've taken or downloaded off the internet. And then you pass in some text saying a bird spreading wings. And what it's going to try to do is it's going to try to take this exact bird in this exact pose and leave everything as similar as possible, but adjust it just enough so that the prompt is now matched. So here we take this, this little guy here. And we say, oh, this is actually what we want this to be a person giving the thumbs up. And this is what it produces. And you can see everything else is very, very similar to the previous picture. So this dog is not sitting. But if we put in the prompt, a sitting dog, it turns it into a sitting dog, leaving everything else as similar as possible. So here's an example of a waterfall. And then you say it's a children's drawing of a waterfall and now it's become a children's drawing. So lots of people in the YouTube chat going, oh my god, this is amazing, which it absolutely is. And that's why we're going to show you how it works. And one of the really amazing things is you're going to realize that you understand how it works already. Just to show you some more examples. Here's the dog image. Here's the sitting dog, the jumping dog, dog playing with a toy, jumping dog holding a frisbee. OK, and here's this guy again, giving the thumbs up, crossed arms, in a greeting pose to Namaste hands holding a cup. So that's pretty amazing. So I had to show you how this works. And I'm not going to go into too much detail. But I think we can get the idea actually pretty well. So what we do is, again, we start with a fully pre-trained, ready-to-go generative model, like a stable diffusion model. And this is what this is talking about here, pre-trained diffusion model. In the paper, they actually use a model called Imogen, but none of the details, as far as I can see, in any way depend on what the model is. It should work just fine for stable diffusion. And we take a photo of a bird spreading wings. OK, so that's our target. And we create an embedding from that, using, for example, our clip encoder, as usual. And we then pass it through our pre-trained diffusion model. And we then see what it creates. And it doesn't create something that's actually like our bird. So then what they do is they fine-tune this embedding. So this is kind of like textual inversion. They fine-tune the embedding to try to make the diffusion model output something that's as similar as possible to the input image. And so you can see here, they're saying, oh, we're moving our embedding a little bit. They don't do this for very long. They just want to move it a little bit in the right direction. And then now they lock that in place. And they say, OK, now let's fine-tune the entire diffusion model end to end, including the VAE. Actually, with image n, they have a super resolution model, but same idea. So we fine-tune the entire model end to end. And now the embedding, this optimized embedding we created, we store in place. We don't change that at all. That's now frozen. And we try to make it so that the diffusion model now spits out our bird as close as possible. So you fine-tune that for a few epochs. And so you've now got something that takes this embedding that we fine-tuned, goes through a fine-tuned model and spits out our bird. And then finally, the original target embedding we actually wanted is a photo of a bird spreading its wings. We ended up with this slightly different embedding. And we take the weighted average of the two. We import the interpolate step, the weighted average of the two. And we pass that through this fine-tuned diffusion model. And we're done. And so that's pretty amazing. This would not take, I don't think, a particularly long time or require any particular special hardware. It's the kind of thing I expect people will be doing in the coming days and weeks. It's very interesting because the ability to take any photo of a person or whatever and change it, literally change what the person's doing is, you know, societally very important and really means that anybody, I guess, now can generate believable photos that never actually existed. I see Jono in the chat saying that took about eight minutes to do it for Imogen on TPUs. Although Imogen's quite a slow, big model, although the TPUs they used were the latest TPUs. So it might be, you know, maybe it's an hour or something for stable diffusion on TPUs. All right. So that is a lot of fun. All right. So with that, let's go back to our notebook where we left it last time. We had kind of looked at some applications that we can play with in this diffusion NB's repo in the stable diffusion notebook. And what we've got now and to remind you, when I say we, it's mainly actually Pedro Patrick and Suraj, just a little bit of help from me. So hugging Facebook's. What we slash they have done is they're starting, is they now dig into the pipeline to pull it all apart step by step. So you can see exactly what happens. The first thing I was just going to mention is this is how you can create those gradual denoising pictures. And this is thanks to something called the callback. So you can say here, when you go through the pipeline, every 12 steps call this function. And as you can see, it's going to call it with I and T and the latency. And so then we can just make an image and stick it on the end of a array. And that's all that's happening here. All right. So this is how you can start to interact with a pipeline without rewriting it yourself from scratch. But now what we're going to do is we're actually going to write it. We're going to build it from scratch. So you don't actually have to use a callback because you'll be able to change it yourself. So let's take a look. So looking inside the pipeline, what exactly is going on? So what's going to be going on in the pipeline is seeing all of the steps that we saw in last week's one note notes that I drew. And it's going to be all the code. And we're not going to show the code of how each steps implemented. So for example, the clip text model we talked about, the thing that takes as input a prompt and creates an embedding. We just take that as a given. So we download it. OpenAI is trained one called clip VIT large patch 14. So we just say from tree chain. So hugging face will transformers will download and create that model for us. Ditto for the tokenizer. And so ditto for the auto encoder and ditto for the unit. So there they all are. We can just grab them. So we just take that all as a given. These are the three models that we've talked about, the text encoder, the clip encoder, the VAE, and the unit. So there they are. So given that we now have those, the next thing we need is that thing that converts time steps into the amount of noise. Remember that graph we drew. And so we can basically, again, use something that hugging face, well actually in this case, Catherine Carlson has already provided, which is a scheduler. This is basically something that shows us that connection. So we've got that. So we use that scheduler. And we say how much noise we're using. And so we have to make sure that that matches. And so we just use these numbers that we're given. OK. So now to create our photograph of astronaut riding a horse again in 70 steps with a 7.5 guidance scale, batch size of one. Step number one is to take our prompt and tokenize it. OK. So we looked at that in part one of the course. So check that out if you can't remember what tokenizing does. But it's just splitting it in. Basically it's splitting it into words or sub word units if they're long and unusual words. So here are. So this will be the start, the start of sentence token. And this will be a photograph of an astronaut, et cetera. And then you can see the same token is repeated again at the end. That's just the padding to say we're all done. And the reason for that is that GPUs and TPUs really like to do lots of things at once. So we kind of have everything be the same length by padding them. That may sound like a lot of wasted work, which it kind of is. But a GPU would rather do lots of things at the same time on exactly the same sized input. So this is why we have all this padding. So you can see here if we decode that number, it's the end of text marker, just padding really in this case. As well as getting the input IDs, so these are just lookups into a vocabulary. There's also a mask, which is just telling it which ones are actual words as opposed to padding, which is not very interesting. So we can now take those input IDs we can put them on the GPU and we can run them through the clip encoder. And so for a batch size of one, so you've got one image, that gives us back a 77 by 768 because we've got 77 here and each one of those creates a 768 long vector. So we've got a 77 by 768 tensor. So these are the embeddings for a photograph of an astronaut riding a horse that come from clip. So remember, everything's pre-trained. So that's all done for us. We're just doing inference. And so remember for the classifier free guidance, we also need the embeddings for the empty string. So we do exactly the same thing. So now we just concatenate those two together because this is just a trick to get the GPU to do both at the same time, because we like the GPU to do as many things at once as possible. And so now we create our noise. And because we're doing it with a VAE, we can call it latency, but it's just noise really. I wonder if you'd still call it that without the VAE. Maybe you would have to think about that. So that's just random numbers normally distributed random numbers of size one. That's our batch size. And the reason that we've got this divided by eight here is because that's what the VAE does. It allows us to create things that are eight times smaller by height and width. And then it's going to expand it up again for us later. That's why this is so much faster. You'll see a lot of this after we put it on the GPU. You'll see a lot of this dot half. This is converting things into what's called half precision or FB16. Details don't matter too much. It's just making it half as big in memory by using less precision. Modern GPUs are much, much, much, much faster if we do that. So you'll see that a lot. If you use something like Fast AI, you don't have to worry about it, that all this stuff is done for you. Then we'll see that later as we rebuild this with much, much less code later in the course. So we'll be building our own kind of framework from scratch, which you'll then be able to maintain and work with yourself. Okay. So we have to say we want to do 70 steps. Something that's very important. We won't worry too much about the details right now, but what you see here is that we take our round of noise and we scale it. And that's because depending on what stage you're up to, you need to make sure that kind of you have the right amount of variance basically. Otherwise you're going to get activations and gradients that go out of control. This is something we're going to be talking about a huge amount during this course. And we'll show you lots of tricks to handle that kind of thing automatically. Unfortunately, at the moment in the stable diffusion world, this is all done in rather, in my opinion, kind of ways that are too tied to the details of the model. I think we'll be able to improve it as the course goes on, but for now we'll stick with how everybody else is doing it. This is how they do it. So we're going to be jumping through. So normally it would take 1,000 time steps, but because we're using a fancy scheduler, we get to skip from 999 to 984, 984 to 970 and so forth. So we're going down about 14 times steps. And remember, this is a very, very, very unfortunate word. They're not time steps at all. In fact, they're not even integers. It's just a measure of how much noise are we adding at each time. And you find out how much noise by looking it up on this graph. That's all time step means. It's not a step of time. And it's a real shame that that word is used because it's incredibly confusing. This is much more helpful. This is the actual amount of noise at each one of those iterations. And so here you can see the amount of noise for each of those time steps. And we're going to be going backwards. As you can see, we start at 999. So we'll start with lots of noise. And then we'll be using less and less and less and less noise. So we go through the 70 time steps in a for loop, concatenating our two noise bits together because we've got the classifier free and the prompt versions. Do our scaling, calculate our predictions from the unit. And notice here we're passing in the time step as well as our prompt. That's going to return two things, the unconditional prediction. So that's the one for the empty string. Remember we passed in one of the two things we passed in was the empty string. So we concatenated them together. And so after they come out of the unit, we can pull them apart again. So dot chunk just means pull them apart into two separate variables. And then we can do the guidance scale that we talked about last week. And so now we can do that update where we take a little bit of the noise and remove it to give us our new latency. So that's the lip. And so at the end of all that, we decode it in the VAE. The paper that created this VAE tells us that we have to divide it by this number to scale it correctly. And once we've done that, that gives us a number which is between negative one and one. Python imaging library expects something between zero and one. That's what we do here to make it between zero and one and like enforce that to be true. Put that back on the CPU. Make sure it's that the order of the dimensions is the same as what Python imaging library expects. And then finally convert it up to between zero and 255 as an int, which is actually what PIL really wants. And there's our picture. So there's all the steps. So what I then did, this is kind of like, so the way I normally build code, I use notebooks for everything, is I kind of do things step by step by step. And then I tend to kind of copy them and I use shift M. I don't know if you've seen that, but what shift M does, it takes two cells and combines them like that. And so I basically combined some of the cells together and I removed a bunch of the pros. So you can see the entire thing on one screen. And what I was trying to do here is I'd like to get to the point where I've got something which I can very quickly do experiments with. So maybe I want to try some different approach to guidance tree classification. Maybe I want to add some callbacks, so on and so forth. So I kind of like to have everything, I like to have all of my important code be able to fit into my screen at once. And so you can see now I do, I've got the whole thing on my screen so I can keep it all in my head. One thing I was playing around with was I was trying to understand the actual guidance tree equation in terms of like, how does it work? Computer scientists tend to write things in software engineers with kind of long words as variable names. Mathematicians tend to use short, just letters normally. For me, when I want to play around with stuff like that, I turn stuff back into letters. And that's because I actually kind of pulled out one note and I started jutting down this equation and playing around with it to understand how it behaves. So this is just like, it's not better or worse, it's just depending on what you're doing. So actually here I said, okay, G is guidance scale. And then rather than having the unconditional and text embeddings, I just call them UNT. And now I've got this all down into an equation which I can write down in a notebook and play with and understand exactly how it works. So that's something I find really helpful for working with this kind of code is to, yeah, turn it into a form that I can manipulate algebraically more easily. I also try to make it look as much like the paper that I'm implementing as possible. Anyhoo, that's that code. So then I copied all this again and I basically, oh, I actually did it for two prompts this time. I thought this was fun. Oil painting of an astronaut riding the horse in the style of Grant Wood. Just to remind you, Grant Wood looks like this. Not obviously astronaut material I thought would make it actually kind of particularly interesting. Although he does have horses. Can't see one here. Some of his pictures have horses. So because I did two prompts, I got back two pictures I could do. So here's the Grant Wood one. I don't know what's going on in his back here, but I think it's quite nice. So yeah, I then copied that whole thing again and merged them all together and then just put it into a function. So I took the little bit which creates an image and put that into a function. I took the bit which does the tokenizing and texting coding and put that into a function. And so now all of the code necessary to do the whole thing from top to bottom fits in these two cells. Which makes it for me much easier to see exactly what's going on. So you can see I've got the text embeddings. I've got the unconditional embeddings. I've got the embeddings which concatenate the two together. Option or random seed. My latents. And then the loop itself. And you'll also see something I do which is a bit different to a lot of software engineering is I often create things which are kind of like longer lines. Because I try to have each line be kind of like mathematically one thing that I want to be able to think about as a whole. So yeah, there's just some differences between kind of the way I find numerical programming works well compared to the way I would write a more traditional software engineering approach. And again, this is partly a personal preference but it's something I find works well for me. So we're now at a point where we've got three functions that easily fit on the screen and do everything. So I can now just say make samples and display each image. And so this is something for you to experiment with. And what I specifically suggest as homework is to try picking one of the extra tricks we learned about like image-to-image or negative prompts. Negative prompts would be a nice easy one. Like see if you can implement negative prompt in your version of this. Or yeah, try doing image-to-image that wouldn't be too hard either. Another one you can add is try adding callbacks. And the nice thing is then you've got code which you fully understand because you know what all the lines do. And you then don't need to wait for the diffusers, folks, to update it. The library to do this, for example, the callbacks are only added like a week ago. So until then you couldn't do callbacks. Well, now you don't have to wait for the diffusers team to add something. The code's all here for you to play with. So that's my recommendation as a bit of homework for this week. Okay, so that brings us to the end of our rapid overview of Stable Diffusion and some very recent papers that very significantly develop Stable Diffusion. I hope that's given you a good sense of the kind of very high level slightly hand wavy version of all this. And you can actually get started playing with some fun code. What we're going to be doing next is is going right back to the start learning how to multiply two matrices together effectively and then gradually building from there until we've got to the point that we've rebuilt all this from scratch and we understand why things work the way they do, understand how to debug problems, improve performance and implement new research papers as well. So that's going to be very exciting and so we're going to have a break and I will see you back here in 10 minutes. Okay, welcome back everybody. I'm really excited about the next part of this. It's going to require some serious tenacity and a certain amount of patience, but I think you're going to learn a lot. A lot of folks have spoken to have said that previous iterations of this part of the course is like the best course they've ever done and this one's going to be dramatically better than any previous version we've done of this. So hopefully you'll find that the hard work and patience pays off. We're working now through the course part 2 p2 repo so 2022 course part 2 and the notebooks are ordered so we'll start with notebook number one and okay so the goal is to get to stable diffusion from the foundations which means we have to define what are the foundations. So I've decided to define them as follows we're allowed to use Python we're allowed to use the Python standard library so that's all the stuff that comes with Python by default. We're allowed to use Matplotlib because I couldn't be bothered creating my own plotting library and we're allowed to use Jupyter notebooks and nbdev which is something that creates modules from notebooks so basically what we're going to try to do is to rebuild everything starting from this foundation. Now to be clear what we are allowed to use are the libraries once we have re-implemented them correctly and so if we re-implement something from NumPy or from PyTorch or whatever we're then allowed to use the NumPy or PyTorch or whatever version. Sometimes we'll be creating things that haven't been created before and that's then going to be building our own library and we're going to be calling that library mini AI so we're going to be building our own little framework as we go. So for example here are some imports and these imports all come from the Python standard library except for these two. Now to be clear one challenge we have is that the models we use in StableDiffusion are trained on millions of dollars worth of equipment for months which we don't have the time or money so another trick we're going to do is we're going to create smaller identical but smaller versions of them and so once we've got them working we'll then be allowed to use the big pre-trained versions so that's the basic idea. So we're going to have to end up with our own VAE, our own UNET our own clip encoder and so forth. To some degree I am assuming that you've completed part one of the course to some degree. I will cover everything at least briefly but if I cover something about deep learning too fast for you to know what's going on and you get lost go back and watch part one or go and Google for that term for stuff that we haven't covered in part one I will go over it very thoroughly and carefully. Alright so I'm going to assume that you know the basic idea which is that we're going to need to be doing some matrix multiplication so we're going to try to take a deep dive into matrix multiplication today and we're going to need some input data and I quite like working with MNIST data. MNIST is handwritten digits. It's a classic data set they're 28 by 28 pixel grayscale images and so we can download them from this URL so we use the PathLibsPath object a lot it's part of Python and it basically takes a string and turns it into something that you can treat as a path for example you can use slash to mean this file inside this subdirectory. So this is how we create a path object. Path objects have for example a make directory, mcta method. So I like to get everything set up but I want to be able to rerun this cell lots of times and not have it like give me errors if I run it more than once. If I run it a second time it still works and in that case that's because I put this exist okay equals true. How did I know that I can say, because otherwise it would try to make the directory, it would already exist and it would give an error. How do I know what parameters I can pass to mcta? I just press shift tab and so when I hit shift tab it tells me what options there are. If I press it a few times it'll actually pop it down to the bottom of the screen to remind me I can press escape to get rid of it or you can just or else you can just hit tab inside and it will list all the things you could type here as you can see. So we need to grab this URL and so Python comes for something for doing that which is the URL lib library that's part of Python that has something called URL retrieve and something which I'm always a bit surprised is not widely used as people reading the Python documentation so you should do that a lot. So if I click on that here is the documentation for URL retrieve and so I can find exactly what it can take and I can learn about exactly what it does and so I read the documentation from the Python docs for every single method I use and I look at every single option that it takes and then I practice with it and to practice with it is inside inside Jupyter so if I want this import on its own I can hit ctrl shift hyphen and it's going to spit it into two cells and then I'll hit alt enter or option enter so I can create something underneath and I can type URL retrieve, shift tab and so there it all is if I'm like way down somewhere in the notebook and I have no idea where URL retrieve comes from I can just hit shift enter and it actually tells me exactly where it comes from and if I want to know more about it I can just hit question mark shift enter and it's going to give me the documentation and most of all second question mark and it gives me the full source code and you can see it's not a lot you know reading the source code of Python standard library stuff is often quite revealing and you can see exactly how they do it and that's a great way to learn more about more about this so in this case I'm just going to use a very simple functionality which is I'm going to say the URL to retrieve and the file name type it as and again I made it so I can run this multiple times so it's only going to do the URL retrieve if the path doesn't exist if I've already downloaded it I don't want to download it again so I run that cell and notice that I can put exclamation mark followed by line of bash and it actually runs this using bash if you're using windows this won't work and I would very very strongly suggest if you're using windows use WSL and if you use WSL all of these notebooks will work perfectly so yeah do that or run it on paper space or lambda labs or something like that co-lab etc okay so this is a gzip file so thankfully Python comes with a gzip module Python comes with quite a lot actually and so we can open a gzip file using gzip.open and we can pass in the path and we say we're going to read it as binary as opposed to text okay so this is called a context manager it's a width clause and what it's going to do is it's going to open up this gzip file the gzip object will be called f and then it runs everything inside the block and when it's done it will close the file so width blocks can do all kinds of different things but in general with blocks that involve files are going to close the file automatically for you so we can now do that and so you can see it's opened up the gzip file and the gzip file contains what's called pickle objects pickled objects is basically python objects that have been saved to disk it's the main way that people in pure python save stuff and it's part of the standard library so this is how we load in from that file now the file contains a tuple of tuples so when you put a tuple on the left hand side of an equal sign it's quite neat it allows us to put the first tuple called x-train and y-train and the second into x-valid and y-valid this trick here where you put stuff like this on the left is called destructuring and it's a super handy way to make your code kind of clear and concise and lots of languages support that including python so we've now got some data and so we can have a look at it it's a bit tricky because we're not allowed to use numpy according to our rules but unfortunately this actually comes as numpy so I've turned it into a list so I've taken the first image and I've turned it into a list and so we can look at a few examples of some values in that list and here they are so it looks like they're numbers between 0 and 1 what I do when I learn about a new data set so when I started writing this notebook what you see here other than the pros here is what I actually did when I was working with this data as I wanted to know what it was so I just grab a little bit of it and look at it so I kind of got a sense now of what it is now interestingly it's 784 this image is 784 long list oh dear people freaking out in the comments no numpy, yeah no numpy do you see numpy? no numpy why 784 what is that? well that's because these are 28 by 28 images so it's just a flat list here of 784 long so how do I turn this 784 long thing into 28 by 28 so I want a 28 list of 28 list of 28 basically because we don't have matrices so how do we do that and so we're going to be learning a lot of cool stuff in Python here sorry I can't start laughing at all the stuff in our chat oh dear people are quite reasonably freaking out that's okay, we'll get there I promise I hope otherwise I'll embarrass myself alright so how do I convert a 784 long list into 28 28 long list of 28 long lists I'm going to use something called chunks and first of all I'll show you what this thing does and then I'll show you how it works so vowels is currently a list of 10 things now if I take vowels and I pass it to chunks with 5 it creates 2 lists of 5 here's list number 1 of 5 elements and here's list number 2 of 5 elements hopefully you can see what it's doing it's chunkifying this list and this is the length of each chunk now how did it do that? the way I did it is using a very very useful thing in Python that far too many people don't know about which is called yield and what yield does is you can see here I've got a loop it's going to go through from 0 up to the length of my list and it's going to jump by 5 at a time it's going to go in this case 0,5 and then it's going to think of this as being like return for now it's going to return the list from 0 up to 5 so it returns the first bit of the list but yield doesn't just return it kind of like returns a bit and then it continues and it returns a bit more and so specifically what yield does is it creates an iterator an iterator is basically something let's actually use it that you can call next on a bunch of times so let's try it so we can say iterator equals okay so what is iterator? iter is something that I can basically I can call next on and next basically says yield the next thing so this should yield vals 0,5 there it is it did right there's vals 0,5 now if I run that again it's going to give me a different answer because it's now up to the second part of this loop now it returns the last 5 okay so um this is what a iterator does now if you pass an iterator to python's list it runs through the entire letter iterator until it's finished and creates a list of the results and what does finish looks like this is what finish looks like if you call next and get stop iteration that means you've run out and that makes sense right there's nothing left in it so all of that is to say we now have a way of taking a list and chunkifying it so what if I now take my full image image number 1 chunkify it into chunks of 28 long and turn that into a list and plot it we have successfully created an image so that's good um now we are done um but there are other ways to create this iterator and because iterators and generators which are closely related are so important I wanted to show you more about how to do them in python because if you understand this you'll often find that you can throw away huge pieces of enterprise software and basically replace it with an iterator it lets you stream things one bit at a time it doesn't store it all in memory um it's this really powerful thing that once I show it to people they suddenly go like oh wow we've been using all this third party software and we could have just created a python iterator python comes with a whole standard library module called iter tools just to make it easier to work with iterators um I'll show you one example of something from iter tools which is iceles so let's grab our um values again these 10 values okay um so let's take um these 10 values and we can take any list and turn it into an iterator by passing it to iter which I should call it so I don't override this python uh but this thing I don't want to override um so this is now basically something that I can call actually let's do this I'll show you that I can call next on it so if I now go next it is giving me each item one at a time okay so that's what converting it into an iterator does um iceles um converts it into a different kind of iterator let's call this maybe iceles iterator and so you can see here what it did was it jumped stop here we are so ah yes that's what would have been better so I should create the iterator and then call next a few times sorry this is what I meant to do it's now only returning the first five before it calls stop iteration before it raises stop iteration so what iceles does is it grabs the first n things from an iterable something that you can iterate uh why is that interesting because I can pass it to list for example right and now if I pass it to list again this iterator has now grabbed the first five things so it's now up to thing number six so if I call it again it's the next five things and if I call it again then there's nothing left um and maybe you can see we've actually now got this defined but we can do it with iceles and here's how we can do it it's actually pretty tricky um iter in python you can pass it something like a list to create an iterator or you can pass it now this is a really important word a callable what's a callable a callable is generally speaking it's a function it's something that you can put parentheses after um uh could even be a class anything you can put parentheses after you can just think of it for now as a function so we're going to pass it a function and in the second form it's going to be called until the function returns this value here which in this case is empty list and we just saw that iceles will return empty list when it's done so this here is going to keep calling this function again and again and again and we've seen exactly what happens because we've called it ourselves before there it is until it gets an empty list so if we do it with 28 then we're going to get our image again so we've now got two different ways of creating exactly the same thing and if you've never used iterators before now's a good time to pause the video and play with them right so for example you could take this here right and if you've not seen lambdas before they're exactly the same as functions but you can define them in line so let's let's replace that with a function okay so now I've turned it into a function and then you can experiment with it um so let's create our iterator and call f on it well not on it call f and you can see there's the first 28 and each time I do it I'm getting another 28 now the first two rows are all empty but finally look now I've got some values call it again see how each time I'm getting something else just calling it again and again and that is the values in our iterator so that gives you a sense of like how you can use Jupyter to experiment so what you should do is as soon as you hit something in my code that doesn't look familiar to you I recommend pausing the video and experimenting with that in Jupyter and for example it most people probably have not used it at all and certainly very few people have used this to argument form so hit shift tab a few times and now you've got it at the bottom there's a description of what it is or find out more Python iter here we are go to the docs well that's not the right bit of the docs see API wow crazy that's terrible let's try searching here iter there we go that's more like it so now you've got links so if it's like okay it returns an iterator object well click on it find out this is really important to know and here's that stop exception that we saw so stop iteration exception we saw next already we can find out what iterable is and here's an example and as you can see it's using exactly the same approach that we did but here it's being used to read from a file this is really cool here's how to read from a file 64 bytes at a time until you get nothing processing it right so the docs of Python are quite fantastic as long as you use them if you don't use them they're not very useful at all and I see see for in the comments our local Haskell programmer appreciating this Haskell in Python that's good it's not quite Haskell I'm afraid but it's the closest we're going to come alright how are we going for time pretty good so now that we've got image which is a list of lists each list is 25 long we can index into it so we can say image 20 let's do it image 20 okay is a list of 28 numbers and then we could index into that okay so we can index into it now normally we don't like to do that for matrices we would normally rather write it like this okay so that means we're going to have to create our own class to make that work so to create a class in Python and then you write the name of it and then you write some really weird things the weird things you write have two underscores a special word and then two underscores these things with two underscores on each side are called Dunder methods and they're all the special magically named methods which have particular meanings to Python and you just got to learn them but they're all documented in the python object model in it object model yay finally okay so what's you eventually find oh it's called data model not object model and so this is basically where all the documentation is about absolutely everything and I can click done to edit and it tells you basically this is the thing that constructs objects so anytime you want to create a class that you want to you want to construct it it's going to store some stuff so in this case it's going to store our image you have to define Dunder in it Pythons slightly weird in that every method you have to put self here for reasons we probably don't really need to get into right now and then any parameters so we're going to be creating an image passing in the thing to store the X's so we're going to be passing in the X's and so here we're just going to store it inside the self so once I've got this line of code I've now got something that knows how to store stuff the X's inside itself so now I want to be able to call square bracket 20 comma 15 so how do we do that well basically part of the data model there's a special thing called done to get item and when you call square brackets on your object that's what Python uses and it's going to pass across the 20 comma 15 here as indices so we're now basically just going to return this so the self dot X's with the first index and the second index so let's create that matrix class and run that and you can now see M 20 comma 15 is the same quick note on ways in which my code is different to everybody else's which it is it's somewhat unusual to put definitions of methods on the same line as the signature like this I do it quite a lot for one-liners as I kind of mentioned before I find it really helps me to be able to see all the code I'm working with on the screen at once a lot of the world's best programmers actually have had that approach as well it seems to work quite well for some people that are extremely productive it's not common in Python some people are quite against it so if you're at work and your colleagues don't write Python this way you probably shouldn't either but if you can get away with it I think it works quite well okay so now that we've created something that lets us index into things like this we're allowed to use PyTorch because we're allowed to use this one feature in PyTorch okay so we can now do that and so now to create a tensor which is basically a lot like our matrix we can now pass a list into tensor to get back a tensor version of that list or perhaps more interestingly we could pass in a list of lists maybe let's give this a name that needs to be a list of lists just like we had before for our image in fact let's do it for our image let's just pass in our image there we go and so now we should be able to say 10s20,15 and there we go so we've successfully reinvented that alright so now we can convert all of our lists into tensors there's a convenient way to do this which is to use the map function in the python standard library so shift shift tab map takes a function and then some iterables in this case one iterable and it's going to apply this function to each of these four things and return those four things and so then I can put four things on the left to receive those four things so this is going to call tensor x-train tensor y-train and so forth so this is converting all of these lists to tensors and storing them back in the same name so you can see that x-train now is a tensor so that means it has a shape property it has 50,000 images in it which are each 784 long and you can find out what kind of stuff it contains by calling it dot type so it contains floats so this is the tensor class we'll be using a lot of it so of course you should read its documentation I don't love the PyTorch documentation some of it's good some of it's not good it's a bit all over the place so here's tensor but it's well worth scrolling through to get a sense of like this is actually not bad right it tells you how you can construct it this is how I constructed one before passing it lists of lists you can also pass it numpy arrays you can change types so on and so forth so you know it's well worth reading through and like you're not going to look at every single method it takes but you're kind of if you browse through it you'll get a general sense right that tensors do just about everything you couldn't think of for a numeric programming at some point you will want to know every single one of these or at least be aware roughly what exists so you know what to search for in the docs otherwise you will end up recreating stuff from scratch which is much much slower than simply reading the documentation to find out it's there so instead of calling chunks or I slice the thing that is roughly equivalent in a tensor is the reshape method so reshape so to reshape our 50,000 by 784 thing we can we want to turn it into 50,000 28 by 28 tensors so I could write here reshape to 50,000 by 28 by 28 but I kind of don't need to because I could just put minus one here and it can figure out that that must be that must be 50,000 because it knows that I have 50,000 by 784 items so it can figure out so minus one means just fill this with all the rest okay now what does the word tensor mean so there's some very interesting history here and I'll try not to get too far into it because I'm a bit over enthusiastic about this stuff I must admit I'm very very interested in the history of tensor programming and array programming and it basically goes back to a language called APL APL is a basically originally a mathematical notation that was developed in the mid to late 50s, 1950s and at first it was used to as a notation for defining how certain new IBM systems would work so it was all written out in this notation it's kind of like a replacement for mathematical notation that was designed to be more consistent and kind of more expressive in the early 60s so the guy who made it was called his name in the early 60s some implementations that actually allowed this notation to be executed on a computer appeared both the notation and the executable implementations slightly confusingly are both called APL APL has been in constant development ever since that time and today is one of the world's most powerful programming languages and you can try it by going to try APL and why am I mentioning it here because one of the things Ken Iverson did well he studied an area of physics called tensor analysis and as he developed APL he basically said like oh what if we took these ideas from tensor analysis and put them into a programming language so in yeah in APL you can and you know have been able to for some time can basically you can define a variable and rather than saying equals which is a terrible way to define things really mathematically because that has a very different meaning most of the time in math instead we use arrow to define things we can say okay that's going to be a a tensor like so and then we can look at their contents of A and we can do things like oh what if we do A times 3 or A minus 2 and so forth and as you can see what it's doing is it's taking all the contents of this tensor and it's multiplying them all by 3 or subtracting 2 from all of them or perhaps more fun we could put into B a different tensor and we can now do things like A divided by B and you can see it's taking each of A and dividing by each of B now this is very interesting because now we don't have to write loops anymore we can just express things directly we can multiply things by scalars even if they're this is called a rank 1 tensor that is to say it's basically in math we call it a vector we can take 2 vectors and can divide one by the other and so forth it's a really powerful idea funnily enough APL didn't call them tenses even though Ken Iverson said he got this idea from tensor analysis APL called them arrays NumPy which was heavily influenced by APL also calls them arrays for some reason PyTorch which is very heavily influenced by APL sorry by NumPy doesn't call them arrays it calls them tenses they're all the same thing they are rectangular blocks of numbers they can be one dimensional they can be like a vector they can be two dimensional like a matrix they can be three dimensional which is like a bunch of stacked matrices like a batch of matrices and so forth if you are interested in APL which I hope you are we have a whole APL and array programming section on our forums and also a whole set of notes on every single glyph in APL which also covers all kinds of interesting mathematical concepts like complex direction and magnitude and all kinds of fun stuff like that that's all totally optional if you do APL say that they feel like they've become a much better programmer in the process and also you'll find here at the forums a set of 17 study sessions of an hour or two each covering the entirety of the language every single glyph so that's all like where this stuff comes from so this batch of 50,000 images 50,000 28x28 images is what we call a rank 3 tensor in PyTorch in NumPy we would call it an array with 3 dimensions those are the same thing so what is the rank the rank is just the number of dimensions it's 50,000 images of 28 high by 28 wide with 3 dimensions that is the rank of the tensor so if we then pick out a particular image then we look at its shape we could call this a matrix it's a 28x28 tensor or we could call it a rank 2 tensor a vector is a rank 1 tensor in APL a scalar is a rank 0 tensor and that's the way it should be a lot of languages and libraries don't unfortunately think of it that way so what is a scalar is a bit dependent on the language ok so we can index into the 0th image 20th row 15th column to get back this same number ok so we can take x train dot shape which is 50,000 by 784 and you can destructure it into n which is the number of images and c which is the number of the full number of columns for example and we can also well this is actually part of the standard library so we're allowed to use min so we can find out in y train what's the smallest number and what's the maximum number so they go from 0 to 9 so you see here it's not just the number 0 it's a scalar tensor 0 they act almost the same most of the time so here's some example of a bit of the y train the bit of y train so you can see these are basically, this is going to be the labels these are our digits and this is its shape so there's just 50,000 of these labels okay and so since we're allowed to use this in the standard library well it also exists in PyTorch so that means we're also allowed to use the dot min and dot max properties alright so before we wrap up we're going to do one more thing and I don't know what the we would call kind of anti-cheating but according to our rules we're allowed to use random numbers because there is a random number generator in the python standard library but we're going to do random numbers from scratch ourselves and the reason we're going to do that is even though according to the rules we could be allowed to use the standard library one it's actually extremely instructive to build our own random number generator from scratch well at least I think so let's see what you think so there is no way normally in software to create a random number unfortunately computers you know add, subtract, times logic gates, stuff like that so how does one create random numbers? well you could go to the Australian National University quantum random number generator and this looks at the quantum fluctuations of the vacuum and provides an API which will actually hook you in and return quantum random fluctuations of the vacuum so that's about, that's the most random thing I'm aware of so that would be one way to get random numbers and there's actually an API for that so there's a bit of fun you could do what Cloudflare does Cloudflare has a huge wall full of lava lamps and it uses the pixels of a camera looking at those lava lamps to generate random numbers Intel nowadays actually has something in its chips which you can call rdrand random numbers on certain Intel chips from 2012 all of these things are kind of slow, they can kind of get you one random number from time to time, we want some way of getting lots and lots of random numbers and so what we do is we use something called a pseudo random number generator a pseudo random number generator is a mathematical function you can call lots of times and each time you call it it will give you a number that looks random to show you what I mean by that I'm going to run some code and I've created a function which we'll look at in a moment called rdrand and if I call rdrand 50 times and plot it there's no obvious relationship between one call and the next that's one thing that I would expect to see from my random numbers I would expect that each time I call rdrand the numbers would look quite different to each other the second thing is rdrand is meant to be returning uniformly distributed random numbers and therefore if I call it lots and lots and lots of times and plot its histogram I would expect to see exactly this which is each from 0 to 0.1 there's a few from 0.1 to 0.2 there's a few from 0.2 to 0.3 it's a fairly evenly spread thing these are the two key things I would expect to see an even distribution of random numbers and that there's no correlation or no obvious correlation from one to the other so we want to try and create a function that has these properties we're not going to derive it from scratch I'm just going to tell you that we have a function here called the Wickman-Hill algorithm this is actually what Python used to use back in before Python 2.3 and the key reason we need to know about this is to understand really well the idea of random state random state is a global variable it's something which is or at least it can be most of the time when we use it we use it as a random variable and it's just basically one or more numbers so we're going to start with no random state at all we're going to create a function called seed we're going to pass something to and I just smashed the keyboard to create this number so this is my random number you could get this from the ANU Quantum Vacuum Generator or from CloudFares Lava Lamps or from your Intel Chips ID RAND or in Python we pretty much always use a number 42 any of those are fine so you pass in some number or you can pass in the current tick count in nanoseconds there's various ways of getting some random starting point and if we pass it into seed it's going to do a bunch of modular divisions and create a tuple of three things and it's going to store them in this global state so RAND state now contains three numbers okay so why did we do that? the reason we did that is because now this function which takes our random state unpacks it into three things and does again a bunch of modifications and moduloes and then sticks them together with various kind of weights modulo one so this is how you can pull out the decimal part this returns random numbers but the key thing I want you to understand is that we pull out the random state at the start we do some math thingies to it and then we store new random state and so that means that each time I call this I'm going to get a different number so this is a random number generator and this is really important because lots of people in the deep learning world screw this up including me sometimes which is to remember that random number generators rely on this state so let me show you where that will get you if you're not careful if we use this special thing called fork that creates a whole separate copy of this python process in one copy os.fork returns true in the other copy it returns false roughly speaking so this copy here if I say this version here the true version is the original non copied it's called the parent and so in my else here so this will only be called by the parent this will only be called by the copy it's called the child and each one I'm calling random these are two different random numbers wrong they're the same number and why is that that's because this process here and this process here are copies of each other and therefore they each contain the same numbers in random state so this is something that comes up in deep learning all the time because in deep learning we often do parallel processing for example to generate lots of augmented images at the same time using multiple processes fastai used to have a bug in fact where we failed to correctly initialize the random number generator separately in each process and in fact to this day at least as of October 2022 torch.rand itself by default fails to initialize the random number generator that's the same number so you've got to be careful now I have a feeling NumPy gets it right let's check is it how you do it? I don't quite remember which way okay NumPy also doesn't how interesting what about Python? ah look at that so Python does actually remember to initialize the random stream in each fork so you know this is something that like even if you've experimented in Python and you think everything's working well in your data loader or whatever then you switch to PyTorch or NumPy and now suddenly everything's broken so this is why we've spent some time re-implementing the random number generator from scratch partly because it's fun and interesting and partly because it's important that you now understand that when you're calling or any random number generator kind of the default versions in NumPy and PyTorch this global state is going to be copied so you've got to be a bit careful now I will mention our random number generator okay so this is this is called percent time at percent is a special Jupiter or IPython function and percent time it runs a piece of Python code this many times so to call it 10 times well actually it will do 7 loops and each one will be 7 times and it will take the mean and standard deviation so here I am going to generate random numbers 7,840 times and put them into 10 long chunks and if I run that it takes me 3 milliseconds per loop if I run it using PyTorch this is the exact same thing in PyTorch it's going to take me 73 micro seconds per loop so as you can see although we could use our version we're not going to because the PyTorch version is much much faster this is how we can create a 784 by 10 and why would we want this that's because this is our final layer of our neural net where if we're doing a linear classifier our linear weights would need to be 784 because that's 28 by 28 by 10 because that's the number of possible outputs, the number of possible digits alright that is it so quite the intense lesson I think we can all agree should keep you busy for a week and thanks very much for joining and see you next time bye everybody