 Hi everybody and welcome to Deep Learning Foundations to Stable Diffusion. Hopefully it's not too confusing that this is described here as lesson nine. That's because, strictly speaking, we treat this as part two of the Practical Deep Learning for Coders series. So that part one had eight lessons, so this is lesson nine. But don't worry, you didn't mess anything. It's the first lesson of part two, which is called Deep Learning Foundations to Stable Diffusion. And maybe rather than calling it Practical Deep Learning for Coders, we should call this Impractical Deep Learning for Coders in the sense that we are certainly not going to be spending all of our time seeing exactly how to do important things with Deep Learning, but we'll be doing a whole lot of fun things, generative, modally fun things, and also a whole lot of understanding lots of details, which you won't necessarily need to know to use this stuff. But if you want to become a researcher, or if you want to put something in production, which has got some kind of complex customization requirements, stuff like that, then it is going to be very helpful to learn the details we'll be talking about. So here in lesson nine, there's kind of going to be two parts to it. One is just a quick run-through, quick-ish run-through of using Stable Diffusion, because we're all dying to play with it, right? And then the other thing that I'll be doing is describing in some detail what's going on. How is it working? There'll be a whole lot of hand-waving either way, because it's going to take us a few lessons to describe everything from scratch. But hopefully, you'll feel like you've come away with this lesson with a reasonable, intuitive understanding, at least, of how this is all working. Assumptions. Well, I'm going to try to explain everything like everything. I'm going to try to explain everything. So if you haven't done deep learning before, this is going to be very hard. But I will at least be trying to say this is roughly what's going on and where you can find out more. Having said that, I would strongly suggest doing part one before doing this course unless you really want to throw yourself in the deep end and give yourself quite the test. If you haven't done part one of practical deep learning for coders, but you're reasonably comfortable with deep learning basics, you could write a basic SGD loop in Python, and you know how to use ideally PyTorch, but TensorFlow is probably okay as well. And you kind of know the basic ideas of what an embedding is and you could create one of those and scratch stuff like that. You'll probably be fine. Generally speaking for these courses, I find most people tend to watch the videos a few times and often the second time through, folks will pause and look things up. They don't know and check things out. Generally speaking, we expect people to be spending about 10 hours of work on each video. Having said that, some people spend a hell of a lot more and go very deep. Some people will spend a whole year in a sabbatical studying practical deep learning for coders in order to really fully understand everything. So really it's up to you as to how deep you go. Okay, so with that said, let's jump into it. And as I said, the first part, we're going to be playing around with stable diffusion and I, you know, tried to, you know, prepare this as late as possible so it wouldn't be out of date. Unfortunately, as of 12 hours ago, it is now out of date. And this is one of the big issues with the bit I'm about to describe, which is like how to play with stable diffusion and exactly how do the details work, which is it's moving so quickly that all of the details I'm going to describe to you today and all of the software I'm going to show you today. By the time you watch this, if you're watching it, so what is it we're up to now? It's 11th of October. So if you're watching this in like December of 2022 or watching this in 2023, the details will have changed. So what's happened today in the last 24 hours is two papers have come out. So what I was going to be telling you today is, for example, to do a stable diffusion generative model. The number of steps required has gone down from 1,000 to about 40 or 50. But then as of, yeah, last night, papers just come out that saying it's now down to four and is 256 times faster. And another paper has come out with a separate, I think orthogonal approach, which makes it another, let's see, 10 to 20 times faster. So things are very exciting. Things are moving very quickly. Now, having said that, don't worry, because after this lesson, we're going to be going from the foundations, which means we're going to be learning all the things of how these are built up. And those don't change much at all. And in fact, a lot of what we'll be seeing is extremely similar to another course we did in 2019. Because the foundations don't change. And once you know the foundations, these kinds of details that you'll find in these papers, you'll be like, oh, I see they did all these things the same way as usual when they made this little change. So that's why we do things from the foundations so that you can keep up with the research, do your own research by taking advantage of this foundational knowledge, which all these papers are building on top of. So anyway, I guess I should apologize that even as I record this, the notebook is now one day out of date. So in part one, you might remember we saw this stuff of Dali 2 illustrations of Twitter bios, which really are pretty cool. So, you know, the cool thing is that we're now at a point we can build this stuff ourselves and run this stuff ourselves. We won't actually be using this particular model. Dali 2 will be using a different model, stable diffusion, but has very similar kind of outputs. But we can go even further now. So one of our wonderful alumni, Alon actually recently started a new company called, I don't know who said it, Strimmer, where you can use something that we'll be learning about today called DreamBooth to put any object, person, whatever into an image. And so he was kind enough to do a quick DreamBooth run for me and added these various pictures of me using his service. So here's a fun service you can try. One crazy one he tried was me as a dwarf, which I've got to say actually worked pretty well. This half looks like me, I reckon. And then the bottom bit is the dwarf version. So thank you, Alon, and congratulations on your great progress since completing the fast AI course. I love it. So something that's a bit different about this compared to previous courses, a lot of previous courses, is this is no longer just a me thing because this is moving so quickly. I've needed to get a lot of help to even vaguely get up to date and stay up to date. So everything I'll be showing you today is very heavily influenced by extremely high levels of input from these amazing folks, all of whom are fast AI alumni. So Jonathan Whitaker, who I saw in our chat, was basically the first guy to create detailed educational material about stable diffusion and has been in the generative model space, well, for a long time by stable diffusion standards, I guess. So Wasim has been an extraordinary contributor to all things fast AI. Petro came to San Francisco for the last time we did a part two course in 2019 and took what he learned there and made his amazing Camera Plus software dramatically better and had it highlighted by Apple for the extraordinary machine learning stuff added. And he's now at Hugging Face working on the software that we'll be using a lot, Diffuses. And then Tanishk, everybody in the fast AI community probably already knows, now at Stability AI, working on stable diffusion models, his expertise, particularly as in medical applications. So really folks from all the key groups pretty much around stable diffusion and stuff are working on this together. And you'll also find some of these folks have recorded additional videos going into more detail about some of the areas which you'll find on the course website. So make sure you go to cost.fast.ai to get all the information about all the materials that you need to take full advantage of this. So every lesson has links to notebooks and to details and so forth. If you want to go even deeper, head over to forum.fast.ai into the part two 2022 category. Hit on the about the course button and you'll find that every lesson, there's a chat with even more stuff. So look at this carefully to see all the things that me and the community have provided to you to help you understand this video. And also check out the questions underneath and answers underneath to see what people have talked about. They can get a bit overwhelming. So once they get big enough, you'll see that there's a summarized button that you can click to kind of see just the most liked parts. So that can be very helpful. Okay, so they're all important resources, I think to get the most to get the most out of this course. Now compute. So completing part two requires quite a bit more compute than part one. Compute options are changing rapidly. And to be honest, the main reason for that is because of the huge popularity of stable diffusion. Everybody is taken to using Colab for stable diffusion. And Colab's response has been to start charging by the hour for most usage. So you may well find if you're a Colab user, we still love Colab. You may find that, you know, you run out of, they start not giving you decent GPUs anymore. And if you want to then upgrade, they limit quite a lot how many hours you can use. So at the moment, yeah, still try Colab, they're pretty good. I mean, for free, you get some decent stuff. But I would strongly suggest trying out also Pepperspace gradient. You can pay like $9 a month to actually get some pretty good GPUs there at the moment, or pay them a bit more to get even better ones. Again, but the thing is this is all going to change a lot. I don't know, like maybe people will make Pepperspace gradient and have to change their pricing too. I don't know. So check course.fast.ai to find out what our current recommendations are. Now Lambda Labs and Java Labs are also both good options. Java was created by alum of the course and has some just really fantastic options at a very reasonable price. And a lot of fast AI students use them and love them. And also check out Lambda Labs who are the most recent provider on this page. And they are rapidly adding new features. But the reason I particularly wanted to mention them is at least as I say this, which they say is early October 2022. They're the cheapest provider of kind of big GPUs that you might want to use to run like serious models. So they're absolutely, yeah, well worth checking out. But as I say, this could all have changed by the time you watch this. So go and check out course.fast.ai. Also at the moment, like 2022, GPU prices have come down a lot. And you may well want to consider buying your own machine at this point. Okay. So what we're now going to do is jump into the notebooks. And so there's a repo that we've linked to called Diffusion NBs. Which isn't kind of the main course notebooks. It's not from the Foundation's notebooks. There's just a couple of notebooks that you might want to play with. A bit of fun stuff to try out. One of the interesting things here is Jonathan Whittaker, who I tend to call Jono. So if I say Jono, that's who I'm referring to, has created this really interesting thing called SuggestedTools.md, which hopefully he'll keep up to date. So even if you come here later, this will still be up to date. Because he knows so much about this area, he's been able to pull out some of the best stuff out there for just starting to play. And I think it's actually important to play, because that way you can really understand what the capabilities are and what the constraints are. So then you can think about like, well, what could you do with that? And also like, what kind of research opportunities might there be? So I'd strongly suggest trying out these things. The community on the whole has moved towards making things available as Colab notebooks. So if I click, for example, on this one, DE Forum, and they often have this kind of hacker aesthetic around them, which is kind of fun. So what happens is they add lots and lots of features, and you can basically just fill in this stuff to try things. And they often have a few examples. And so you can hit up the runtime and say, change runtime type to make sure it says GPU. And you can say, what kind of GPU? And start running things. Now, a lot of the folks who use this stuff honestly have no idea what any of these things mean. Now, by the end of the course, you'll know what all of these things mean, pretty much. And that will help you to make great outputs and stuff like this. But you can create great outputs just using more of an artisanal approach. There's lots of information online about what kinds of things could you try. So anyway, check out this stuff from Jono. And then he also links to this fantastic resource from Pharmapsychotic, which is a rather overwhelming list of things to play with. Now, again, maybe by the time you watch this, this has all changed. But I just wanted to know these kind of things are out there, and they're basically like ready to go applications that you can start playing with. So play a lot. What you'll find is that most of them, at least at the moment, expect you to input some text to say what you want to create a picture of. It turns out that as we'll learn in detail why, the text you pick, it's not very easy to know what to write. And that gives kind of interesting results. At the moment, it's quite an artisanal thing to understand what to write. And the best way to learn what to write is called the prompt. The best way to learn about prompts is to look at other people's prompts in their outputs. So at the moment, perhaps the best way to do that is Lexica, which has lots and lots of really interesting artworks. And so AI artworks. And so you can click on one and see what prompt was used. And so you'll see here that generally you start with what do you want to make a picture of, what's the style. And then the trick is to add a bunch of like artists' names or places that they put art so that the algorithm will tend to create a piece which matches art that tends to have these kinds of words in their captions. So there's a really useful trick to kind of get good at this. And so you can even search for things. So I don't know if they have teddy bears. Let's try. There we go. So if there's a kind of like a, probably not that one. That's a pretty good teddy bear image. So you can kind of get some sense of how to create nice teddy bear images. It's so cute. I know what I'm going to be showing my daughter tomorrow. And you can see they often tend to have similar kinds of stuff to try to encourage the algorithm to give good outputs. Okay, so by the end of this course you'll understand why this is happening, why these kinds of prompts create these kind of outputs and also how you can go beyond just creating prompts to actually building really innovative new things with new data types. Okay, so let's take a look at the Diffusion MBs repo. The first thing we'll look at is stable diffusion. So a couple of options here. You can clone this repo, which is linked from both the course.fast.ai and from the forum and run it on like a paper space gradient or your own machine or whatever. Or you can head over to Colab and you can just say GitHub, and then you can paste in the link to it directly from GitHub. Okay, so I'm running it on my own machine and this notebook is largely being built thanks to the wonderful folks at Huggingface and Huggingface have a library called Diffusers. So any of you that have done part one of the course will be very familiar with Huggingface. We used a lot of their libraries in part one. Diffusers is their library for doing stable diffusion and stuff like stable diffusion. At the moment, these things are changing a lot, but at the moment this is our recommended library for doing this stuff and it's what we'll be using in this course. Maybe by the time you watch this, there'll be lots of other options. So again, keep an eye on course.fast.ai. In general, Huggingface have done a really good job of being at and staying at the head of the pack around models in general for deep learning. So it would be not surprising if they continued to be the best option for quite a while. But the basic idea of any library is going to look pretty similar. So to get started playing with this, you will need to log in to Huggingface. So if you've got a Huggingface, you can create a username there and a password and then log in. Once you've done it once, it'll save it on your computer so you won't have to log in again. And the thing we're going to be working with is pipelines and in particular the stable diffusion pipeline. Again, you might be using different pipelines by the time that you watch this. But the basic idea of pipeline is quite similar to what we call a learner in fast.ai, which is it's got a whole bunch of things in it, a bunch of kind of processing and models and inference all happening automatically. And just like you can save a learner in fast.ai, you can save a pipeline in diffusers. Now something that you can do in all pretty much all Huggingface libraries that you can't do in fast.ai is you can then save a pipeline or whatever back up into the cloud onto Huggingface they call it the hub. And so then if we say from pre-trained, it's a lot like how we create pre-trained learners in fast.ai but the thing you put here is actually, if it's not a local path, it's a Huggingface repo. So if we search Huggingface for this, and you can see this is what it's going to download. And you can actually save your own pipelines up to the hub for other people to use. So I think this is a very nice feature that helps the community build stuff. So this is actually going to, the first time you run this, it's going to download many gigabytes of data from the internet. This is one of the slight challenges with using this on Colab is every time you use Colab, everything gets thrown away and start from scratch. So it'll all have to be downloaded every time you use Colab. If you use something like Paperspace or particularly actually Lambda Labs, it's all going to be saved for you. So once you've downloaded all this, it's going to save a whole bunch of stuff into your .cache in your home directory. So that's where Huggingface puts things. So now that we have a pipeline called pipe, we can now treat it as if it's a function. It is pretty common for like PyTorchy stuff and FastAI stuff. You should be very familiar with this hopefully. And you can pass it a prompt. And so this is just some text. And that's going to return some images. Since we're only passing one prompt, it's going to return one image. So we'll just index into .images. And when we run it, it takes, you know, maybe 30 seconds or so and returns a photograph of an astronaut riding a horse. Every time you call a pipeline using the same random seed, you'll get the same image. You can set the random seed manually. And so you could send it to somebody else and say, oh, this is a really cool astronaut riding a horse I found. Try manual seed 1024. And you'll get back this particular astronaut riding a horse. So that's how you can like, that's the most basic way to get started running on Colab or on your own machine. You can start creating images. It takes, as I said, takes 30 seconds or so. And in this case, it took 51 steps. What it's doing, this is a bit very different to like what we're used to with inference in FastAI, where it's one step to classify something, for example. What it's doing in these 51 steps is it's starting with like, so this is actually an example that we're going to create ourselves ourselves in the course of creating handwritten digits. And this is actually an image from a later notebook we'll be building. Well, it basically starts with random noise and each step it tries to make it slightly less noisy and slightly more like the thing we want. And so going down here is showing all the steps to create the first four, for example, or here to create the first one. And if you look closely, you can kind of see in this noise there is something that looks a bit like a one and so it kind of decides to focus on that. And so that's how these diffusion models basically work. So remember, if you're having any trouble finding the materials we're looking at to go to course.fast.ai or go to the forum topic to see all the links. And this one is called diffusion-nbs. And the notebook is called, you can see it at the top, stable diffusion. Now, a question might be, well, why don't we just do it in one go? And we can do it in one go, but if we try to do it in one go, it doesn't do a very good job. These models aren't, as I speak now in October 2022, smart enough to do it in one go. Now, as I mentioned at the start, the fact that I'm doing it in 51 steps here is hopelessly out of date because as of yesterday, apparently we can now do it in three to four steps. I'm not sure if that code's available yet. So by the time you see this, this might all be dramatically faster. But as I'll be describing understanding this basic concept, I'm pretty confident it's going to be very important like forever. So we'll talk about that. So if we do 16 steps instead of 51 steps, it looks a bit more like it, but it's still not amazing. Okay, so that's how you can kind of get started. And I'll show you a few things that you can tune. And I should remind you that most of the stuff I'm showing you in this was built by Pedro Cuenca and the other folks at Huggingface. So huge thanks to them. There's no way I could have been as up to speed with all this detail without their help. They built this library, diffuses, and have done a fantastic job of helping display what you can do with it. So let's look at an example of what you can do with it. We're just going to quickly define a little function here to create a grid of images. The details don't matter. But what we do want to show here is you can take your prompt which was an astronaut riding a horse and just create four copies of it. Okay, so times when applied to a list simply copies the list that many times. So here's a list of the exact same prompt four times. And then what we're going to do is we're going to pass to the pipeline the prompts. We're going to use a different parameter now called guidance scale. We're going to be learning about guidance scale in detail later in the course. But basically what this does is it says to what degree should we be kind of focusing on the specifics caption versus just creating an image. So we're going to try a few different guidance scales about one, three, seven, 14. Generally seven and a half I believe at this stage is the default that might have changed by the time you watch this. And so each row here is a different guidance scale. So you can see in the first row it hasn't really listened to us very much at all. These are very weird looking things and none of them really look like astronauts riding a horse. At guidance scale of three they look more like things, riding horses that they might be astronaut-ish. And at 7.5 they certainly on the whole look like astronauts riding a horse. And at 14 or 15 they certainly look like that but they're going to get a little bit too abstract sometimes. I have a pretty strong feeling there are some slight problems with actually how this is coded or actually how the algorithm works which I will be looking at during this course but maybe by the time you see this some of these will be looking a bit better. I think basically something that's happening here is it's actually kind of over jumping a bit too far during these high ones. Anyway, so the basic idea of what it's doing here is this guidance is it's basically actually for every single prompt it's creating two versions. One version of the image with the prompt an astronaut riding a horse and one version of the image with no prompt. So it's just some random thing. And then it takes the average basically of those two things. And that's how that's what guidance scale does and you can kind of think of the guidance scale as being a bit like a number that's used to weight the average. There's something very similar you can do where again you create get the model to create two images but rather than taking the average you can ask it to effectively subtract one from the other. So here's something that Pedro did of using the prompt a Labrador in the style of Vermeer and then he said well, what if we then subtract something which is just the model for the caption blue and you can pass in this thing negative prompt to diffuses and what that will do is it will take the prompt which in this case is Labrador in the style of Vermeer and create a second image effectively which is just responding to the prompt blue and effectively subtract one from the other. The details are slightly different to that but that's the basic idea and that way we get Labrador in the style of Vermeer. So this is the basic idea of how to use negative prompt and you can play with that, good fun. Here's something else you can play with is you don't have to just pass in text you can actually pass in images so for this you'll need a different pipeline you'll need an image to image pipeline and with the image to image pipeline you can grab a rather sketchy looking sketch and you can then pass to this i to i image to image pipeline the initial image to start with and basically what this is going to do is rather than starting diffusion process with random noise it's going to basically start it with a noisy version of this drawing and so then it's going to try to create something that matches this caption and also follows this kind of exciting starting point and so as a result you get things that look quite a lot better than the original drawing but you can see that the composition is the same and so using this approach you can you know construct things that match the particular kind of composition you're looking for so I think that's quite a nifty approach and so here this parameter strength is saying to what degree do you want to really create something that looks like this or to what degree do you want the model to be able to you know try out different things a bit now here's where things get interesting and this is the kind of stuff you're not going to be able to do at the moment with just the basic GUIs and stuff but you can if you really know what you're doing what we could do now is we could take these output images and we could say oh this one's nice sorry this one this one's nice let's make this the initial image and now we'll say let's do an oil painting of by van Gogh and pass in the same thing here and the strength of one and actually that pretty much worked and I think that's absolutely fascinating right something I haven't seen before which Pedro put together this week and it's combining simple Python code together and so you can play with that something else you can do which this one's actually example came from the folks at Lambda Labs is and we won't be going into this in detail right now because this is like basically exactly like what we've done a thousand times in fast AI is you can take the models in that pipeline and you can pass it your own images and your own captions and so what happened here is oh I hate these things go away never mind oh here we are this one so what these folks did I think this was Justin if I remember correctly yeah so what Justin at Lambda did was he created a really cool data set by going to grab a Pokemon data set of images which had almost a thousand images of Pokemon and then this is really neat he then used an image captioning model to automatically generate captions for each of those images and then he fine-tuned the stable diffusion model using those image and caption pairs so here's an example of one of the captions and one of the images and then took that fine-tuned model and passed it prompts like girl with a pel earring and cute Obama creature and got back these super nifty images that now are reflecting the fine-tuning data set that he used and also responding to these prompts here's another example of something you can do fine-tuning can take quite a bit of data and quite a bit of time but you can actually do some special kinds of fine-tuning one that you can do is called text-tuning which is where we actually fine-tune just a single embedding so for example we can create a new embedding where we're trying to make things that look like this so what we can do is we can give this concept a name so here we're going to call it oh I just lost it now there we are we're going to call it watercolor portrait and so that's what the embedding name we're going to use is and we can then basically add that token to the text model and then we can train the embeddings for this so that they match the example pictures that we've seen and this is going to be much faster because we're just training a single token for just in this case four pictures and so when we do that we can then say for example woman reading in the style of and then passing that token we just trained and as you see we'll get back a kind of novel image which I think is pretty pretty interesting another example very similar to textual inversion is something called dream booth which as mentioned here what it does is it takes an existing token but one that isn't used much like say SKS nothing almost nothing has SKS and fine-tunes a model to bring that token as it says here close to the images we provide and so what Pedro did here was he grabbed some pictures of me and said painting of SKS so in this case he's fine-tuned this token to be Jeremy Howard photos in the style of Palsignac and there they are and so the example I showed earlier of the draw off Jeremy Howard that server streamer is actually using this dream booth so here's how you can try that yourself okay so that is part one of of this lesson which is the how to get started playing around with stable diffusion in part two we're going to talk about what's actually going on here from a machine learning point of view so we'll come back in about seven minutes to talk about that see you guys in about seven minutes okay welcome back folks I just thought I'd share with you one more example actually of textual inversion training this is my daughter's teddy tiny who as you can see is grossly misnamed and Pedro and I tried to create a textual inversion version of tiny and I was trying to get tiny riding a horse and it's interesting that when I tried to do that this top row here this is actually Pedro's example when he ran it this is showing the kind of steps as he was training of trying to use the caption tiny riding a horse and as you can see it never actually ended up generating tiny riding a horse instead it ended up generating a horse that looks a little bit like tiny and then we're trying to get tiny sitting on a pink rug and actually after a while it did make some progress there it doesn't quite look like tiny one thing Pedro did that was different to me was he started with the embedding for person in my one actually started with the embedding for teddy and it worked a bit better but as you see there are like problems and we'll understand where those problems come from as we talk more about how this is trained in the rest of this lesson okay so I'm going to be relying on some understanding of the basic idea of how machine learning models are trained here so if you start getting a bit lost at any point you might want to go back to part one and then come back to this once you're unlost the way we're going to start okay so I need to explain the way stable diffusion is normally explained is focused very much on a particular mathematical derivation we've been developing a totally new way of thinking about stable diffusion and I'm going to be teaching you that it's mathematically equivalent to the approach which you'll see in other places but what you'll realise and discover is that it's actually conceptually much simpler and also later in this course we'll be showing you some really innovative directions that this can take you when you think of it in this brand new way so all of which is to say when you listen to this and then you go and look at some blog post and it looks like I'm saying something different just keep that in mind, I'm not saying something different I'm expressing it in a different way but it's equally mathematically valid what I'm going to do is I'm going to start by saying let's imagine that let's imagine that we were trying to get something to generate something much simpler which is to generate handwritten digits okay so it's like the stable diffusion for handwritten digits and we're going to start by assuming there's some like API, some web service or whatever out there who knows how it was made but what it does is something pretty nifty which is that you can get an image of a handwritten digit and you can pass it over into this this web API into this rest endpoint or whatever it's just a black box as far as we're concerned and it's going to spit out the probability that this thing you passed in is a handwritten digit so for this one so let's say this image is called X1 the probability that X1 is a handwritten digit it might say is 0.98 and so then you pass something else into this magic API endpoint which looks like this you pass that in and that looks a little bit like an 8 I guess but it might not be pass it into this API and you see what happens this is X2 and it says the probability that X2 is a digit is 0.4 okay now we pass in our image X3 into our magic API and it returns the probability that X3 is a handwritten digit pretty small okay so why is this interesting well it turns out that if you have a function you know let's not call this an API let's call this let's call this it's called an F some function but it's like behind some API rest endpoint whatever if you have this function we can actually use it to generate handwritten digits so that's something pretty magical and we're going to see how on earth would you do that if you have this function which can take an image and tell you the probability that that is a handwritten digit how could you use it to generate new images well imagine you wanted to turn this mess into something that did look like an image here's something you could do let's say it's a 28 by 28 image which is what 786 oopsie dozey 28 so 784 pixels and we could pick one of these pixels and say what if I increase this pixel to be a little bit darker and then we could pass that image through F and we could say what happens to the probability that it's a handwritten digit so for a specific example handwritten digits don't normally have any pixels that are black in the very bottom corners so if we took this here and we said what would happen if we made this a little bit lighter right and then we passed that exact image through here the probability would probably go up a tiny bit for example so now we've got an image which is slightly more like a handwritten digit than before and also in digits generally there are straight lines so this pixel here probably makes sense for it to be darker so if we made a slightly darker version of this pixel and sent it through here that would also increase the probability a little bit and so we could do that for every single pixel of the 28 by 28 one at a time finding out which ones if we make them a little bit lighter make it more like a handwritten digit which ones if we make it a little bit darker make it more like a handwritten digit what we've just done is we've calculated the gradient of the probability that x3 is a handwritten digit with respect to the pixels of x3 now notice that I didn't say d px3 dx3 which you might be familiar with from high school and the reason for that is that we've calculated this for every single pixel and so when you do it for lots of different inputs you have to turn the d into a this is called a del or an abla which means there's lots of values here so this here contains lots of values which is the how much does the probability that x3 is a digit increase as we increase this pixel value and as we increase this pixel value as we increase this pixel value so there's going to be for 28 by 28 inputs there's going to be 784 pixels which means that this thing here has 784 values okay I totally messed up the notation there I did think about going back and re-recording it but then I thought well maybe instead as penance for my failure to get the notation right I should instead record a little section describing the notation in more detail both for myself so I don't make it a mistake again and for the rest of years so you understand exactly what's going on I think it's actually pretty worthwhile because this notation does come up a lot and I've been regularly butchering it in talks and notes for years now so it's about time I got it right so I should mention I have absolutely no excuse for butchering the notation like I have on the basis that actually my friend Terence and I wrote a what is it like 30 40 page tutorial on matrix calculus and in that paper he actually described everything I'm going to show you here having said that you certainly don't need to read this rather lengthy tutorial I'm going to explain the key stuff that actually I think it's worth knowing and then you'll understand the mistake that I made during the lesson maybe let's start with reminders from stuff that hopefully you did at high school so let's create maybe a 2d version here and maybe a 3d version as well so let's say for example we've got a quadratic that looks something like this and we can say this is a equation such as y equals x squared and we might endeavor to identify the slope kind of at some exact moment like like here this slope at this exact moment here so this thing is called the tangent and the slope at a tangent is the derivative of a function and so the derivative let's just try to make this look a bit more like a y than an x maybe I write it like this okay and so the derivative of a function we can write in a few ways but one we can write it is like this dx and there are some rules we can use to calculate it analytically and for squared the rule is that we it's 2x you basically move the index out to the front to calculate its derivative another way of writing y equals x squared is we could write f of x equals x squared and another way of writing the derivative is we could use this for example okay so this is all stuff that hopefully you at least vaguely remember from high school now functions are not necessarily just of one variable so here we've got x and y but there could be of two variables x and y and z and so we could have functions which for example could be like a 3D parabola with this kind of curvature for instance and so you can still find the derivative but if you think about it there's one way to kind of get a derivative would be exactly what we did before which would we say as we change x how does it change z so that would be this slope again but you could also have something that says as we change y how would we change z that would be like rotating this whole thing around by 90 degrees and then kind of doing the same thing so it's a little bit trickier now because we've got a function of x and y and so we could calculate the derivative of that with respect to just one of these things or both of these things or whatever and so what we do here is we write this little thing and so we can then say this is how our output z changes as we change one thing at a time in this case just x there's another value which could be how does it change as we change just y one thing at a time and so those are two separate numbers we could calculate at some particular point on this surface and so these things are called partial derivatives what is partials so in our case we've got a 28 by 28 pixel image which might be for example the number 7 made of 28 by 28 pixels so the pixels would be you know something like this and then down here and so in our case we've got a situation where we're saying we've got some loss and our loss is calculated at some function of both some weights in a neural network as well as some some pixel values such as the pixels in this number 7 and I guess you know actually the way it would work with these would be shaded this is more like what MNIST looks like right so that's my pixels aren't they terrible okay so so the loss would be calculated more specifically as the MSE mean squared error of the actual the actual I guess we should use that's correct the actual answer of like which digit should it be so let's call that y minus the predicted y which is some neural network with some weights and our pixels so that would be like delving in one layer deeply but none of these details really matter too much of like what's the loss function or the neural network or whatever it's just some function that's calculating loss so let's get rid of all that okay and we can now say what happens as we change x as we change x what happens to the loss because we want to change x in a way that the loss goes down but there isn't just one x there's oh I wrote this as 7 by 7 it's actually meant to be 28 by 28 but let's just do a simple 7 by 7 version here we've got 7 pixels by 7 pixels this is a super low resolution 49 pixel image so there's 49 different things we could change we could make each of these pixels darker or lighter alright so let's take for example pixel maybe we can write it like this we'll say pixel 1 comma 1 we could write it like that for example and so we could say what happens to the loss as we change pixel 1 comma 1 so we can calculate a derivative right and it's going to be a partial partial derivative how does the loss change as we change pixel 1 comma 1 so that would be a very useful thing to know because that would then tell us do we need to make pixel 1 comma 1 a bit brighter or a bit darker in order to improve the loss and we could also calculate that for pixel 1 comma 2 1 comma 3 and so forth for all 49 pixels in this super low resolution digit so okay so there's the first thing we can calculate to then the second thing I mentioned could be loss over pixel 1 comma 2 okay so that's the slope change pixel 1 comma 2 and I'm not going to write all of them but there could be well there will be 49 of these for each 7 by 7 so rather than writing out all 49 of those it's nice to instead write them all at once and say and you can do that like so you can say upside down triangle x loss and what that means is it's a vector of all of these derivatives and this upside down triangle here is called either the dell or the nabla and it's just a little convenient notational shortcut to avoid writing all these out and the x here is telling you about the thing that we're basically putting on the bottom right what's it with respect to what's the direction that we're trying to go so that's actually what I should have written in the notes that I was doing in the lesson I wrote something else which was basically the equivalent of writing this and that's not a thing at all so in my notes when you see me write this I actually mean this so why does my brain get confused and I write this weird thing that doesn't even exist as far as I know well the reason is that this thing does exist if you turn the triangles upside down and hence my brain always gets confused and if I turn the triangles upside down these triangles are now totally different these triangles now mean a small change so this is a small change in loss divided by a small change in for example one particular pixel and that's a totally valid thing to say and in fact if you make that small change small enough then you're going to end up with a derivative that's what the derivative is so the derivative is just our classic rise over run slope that we did in what is that grade 8, grade 9, rise over run once we make our step in x small enough and so then we can then do that for changing just one variable in a multi-variable function so for example I guess in this case for example changing one pixel value in an image to see how that impacts our loss we could do the same thing for the weights we could change one weight and if you just change one thing at a time and calculate the derivative of the loss against that one thing then we get these things called partials and if you then do it for all the different things that you could change such as every pixel you get this whole gradient vector which we use the upside down triangle to represent and then finally the upside down triangle simply refers to a small change so this would be a small change in loss which is caused by changing pixel 1, 1 by a small bit and if you use a small enough bit an infinitely small bit we call that the derivative okay so with all that said net result every time you see me well I think I did it once but in the notes I say this please throw that away in your head and replace it in your head with this and that is the moral of the story okay so thank you very much for bearing with me as I do my penance to actually get this notation correct this time and I will endeavour not to make the same mistake again during this course but no promises okay so with those 784 values they tell us how can we change x3 to make it look more like a digit and so what we can then do is we can now change the pixels according to this gradient and so we can do something a lot like what we do when we train neural networks except instead of changing the weights in a model we're changing the inputs to the model and so we're going to take every pixel and we're going to modify it subtract its gradient a little bit times its gradient so we'll multiply this by some constant let's call it C and then we're going to subtract it to get some new image so with the new image it's probably going to get rid of some of these bits at the bottom right and it's probably going to add a few more bits between some of these here and we've now got something that looks slightly more like a hand written digit than before and this is the basic idea we can now do that again we can now take this we can run it through F and so we've now got something like say we call it x3 prime for example the new version x3 prime whatever is now the probability that's a hand written digit it's quite a bit higher I'd say it's probably like 0.2 maybe and we can now do the same thing we can say for every pixel if I increase its value a little bit or decrease its value a little bit how does it change the probability that this new x3 whatever prime prime is a digit and so we'll now get a new gradient here 784 values and we can use that to change every pixel to make it look a little bit more like a hand written digit so as you can see if we have this magic function we can use it to turn any arbitrary noisy input into something that looks like a valid input something that has a high p-value from that function by using this derivative so a key thing to remember here is this saying as I change the input pixels how does it change the probability that this is a digit and that tells me which pixels to make darker and which pixels to make lighter now those of you who remember your high school calculus may recall that when you do this by changing x-pixel one at a time to calculate a derivative this is called the finite differencing method of calculating derivatives and it's very slow because we have to call sorry I can't spell differencing it's very slow because we have to call it 7 dysfunctions 784 times every single one we don't have to use finite differencing assuming the folks running this magic API endpoint used Python we can just call f.backward and then we can get x 3.grad and that will tell us the same thing in one go by using the analytic derivatives so we're not exactly about what these.backward does we'll write our own everything from scratch including our own calculus things from scratch later but for now just like we did in part one of the course we're just going to assume these things exist so maybe then the nice folks that provide this endpoint could actually provide a new endpoint that calls.backward for us and gives us.grad and then we don't really have to use f at all we can instead just directly call this endpoint or just directly call this endpoint that gives us the gradient directly we'll multiply it by this smaller constant c we'll subtract it from the pixels and we'll do it a few times making the input get larger and larger p-values that this is actually a digit so we don't particularly need this thing at all we don't particularly need the thing that calculates these probabilities we only need the thing that tells us how which pixels we should change to calculate the probabilities okay so that's great the problem is nobody's provided this for us so we're going to have to write it so how are we going to do that well no problem generally speaking in this course when there's some magic black box that we want to exist and it doesn't exist we create a neural net and we train it so we want to train a neural net that tells us which pixels to change a digit to make an image look more like a handwritten digit okay so here's how we can do that we could create some training data and use that training data to get the information we want we could pass in something that looks a lot like a handwritten digit we could pass something that looks a bit like a handwritten digit we could pass something in that doesn't look very much like a handwritten digit and we could pass in something which doesn't really look like a handwritten digit at all now you'll notice it was very easy for me to create these I created real handwritten digits and then I just chucked random noise on top of it it's a little bit awkward for us to come up with an exact score saying how much is that like a handwritten digit how much is that like a handwritten digit how much is that and how much is that it seems a bit arbitrary so let's not do that let's use something which is kind of like the opposite and instead let's say oh why don't we predict how much noise I added because this number 7 is actually equal to this number 7 plus this noise and this number 3 is actually equal to this number 3 plus this noise and this number 6 is actually equal to this number 6 plus this noise and that one's got a lot And of course the very first one is equal to this number nine plus this noise. So why don't we generate this data? And then rather than trying to come up with some arbitrary number of like how much like a digit is it, let's say the amount of noise tells us how much like a digit it is. So something with no noise is very much like a digit and something with lots of noise isn't much like a digit at all. So let's feed in, let's create a neural net. Who cares what the architecture is, right? It's just a neural net of some kind. And this is critical to your understanding of this course at this point. We're going to go beyond the idea of like worrying all the time about architectures and details. And we're going to be spending quite often, we're going to get, I mean we're going to get to all those details, but the important thing to using this stuff well is to think about neural nets as being something that has some inputs, some outputs, oopsie-daisy, some outputs and some loss function which takes those two and then the derivative is used to update the weights. That's really what we care about those four things. Now the inputs to our model is this. Okay that's the inputs to our model. The outputs to our model is a measure of how much noise there is. So maybe we could just say oh well what's the, these are all basically normally distributed random variables with a mean of zero and a variance in this case of zero. In this case they're normally distributed random variables with a mean of zero and a variance of like 0.1. This one's normally distributed random variables, pixels I guess, with a mean of zero and like 0.3. This one's super noisy. There's the mean and variance. So that's the mean for each one and the variance for each one. So why don't we as the output use the variance? So predict how much noise or better still why don't we predict the actual noise itself? So why don't we actually use that? Now we're not just predicting how much noise but we'll predict the actual noise. That's our outputs. Now if we do that our loss is going to be very simple. It's going to be, we took the input, we passed it through our neural net, we tried to predict what the noise was and so the prediction of the noise is n hat and the actual noise is n and so we can do something we've done a thousand times which is we can divide it by the count squared and then we can sum all that up and this here is the mean squared error which we use all the time. So the mean squared error means that we've now got inputs which is noisy digits. We've got outputs which is noise and so this neural network is trying to predict this noise. So we're basically jumping straight to the step that we had here. Remember this is what we really wanted. We wanted some ability to know how much do we have to change a pixel by to make it more digit like. Well to turn this number seven into this number seven that's our goal we have to remove all of that. So if we can predict the noise then we've got exactly what we want which is this. We can then do this process. We can take model plate by constant and subtract it from our input and so if you subtract this noise from this input you get this handwritten digit. So we're doing exactly what we wanted. Well that seems easy enough. We already know from part one how to do this. So we just have any old neural network so some kind of con net or something that takes this input numbers where we've just randomly added different amounts of noise lots of noise to some not much noise to others. It predicts what the noise was that we added. We take the loss between the input sorry between we take the loss between the predicted output and the actual noise mean squared error and we use that to update the weights and so if we train this for a while then if we pass this into our model it will return that and we're done. We now have something that can generate images. How? Because now we can take this trained neural network so I'm going to copy it down here and we can pass it something very very very noisy which is pure noise. We pass it to the neural net and it's going to spit out information saying which part of that does it think is noise and it's going to leave behind the bits that look the most like a digit just like we did back here. So it might say oh you know what if you left behind just that bit that bit that bit that bit that bit that bit that bit that bit and that bit it's going to look a little bit more like a digit and then maybe you could increase the values of that bit that bit that bit that bit that bit and that bit and so after you do that and so that everything else is noise so we subtract those bits subtract it times some constant we're now going to have something that looks more like a digit which is what we hoped for and so then we can just do it again and you can see now why we you can see now why we are doing this multiple times. Somebody on the chat saying they don't see me drawing? Oh you can see. Thanks Jimmy. Don't know Michelangelo what's happening for you. Okay and to answer your earlier question about how am I drawing I'm using a graphics tablet which I'm not very expert at because on Windows you can just draw it directly on the screen which is why this is particularly messy. Alright in practice at the moment this might change by the time you've watched this we use a particular type of neural net for this the particular kind of neural net we use is something that was developed for medical imaging called the unit. If you've done previous versions of the course you will have seen this and don't worry this course will see exactly how a unit works and we'll build them ourselves and scratch and this is the first component of stable diffusion it's the unit. Okay so there's going to be a few pieces and the details of why they're called these things don't matter too much just yet just take my word for it this is their names and the thing that you do need to know for each thing is like what's the input and what's the output so the input to a unit or what does it do the input to the unit is a somewhat noisy image and when I say somewhat it could be not noisy at all or it could be all noise that's the input and the output is the noise such that if we subtract the output from the input we end up with the unnoisy image or at least an approximation of it so that's the unit now here's the problem well here's our problem we have oh what do I keep forgetting this we have 28 times 28 784 I should write that down we have in these things 784 pixels and that's quite a lot and it gets worse because in practice we don't want to draw a handwritten digits the thing that we'd be passing in here is beautiful high definition photos or images of like great paintings and at the moment the thing we tend to use for that is a 512 by 512 by 3 channel RGB nice big image 512 by 3 red green blue these are the pixels so that is 512 by 512 by 3 786 432 so we've got 700 a 786 432 pixels in here and so this is I don't know some beautiful picture this is my amazing portrait van Gogh style in a dainty little hat there we go so this is the beautiful painting or an image of it it's that's a lot of pixels and so training this model where we put noisy versions of millions of these beautiful images is going to take us an awful long time and you know if you're Google with a huge cloud of TPUs or something maybe that's okay but for the rest of us we would like to do this as efficiently as possible how could we do this more efficiently well when you think about it in this beautiful picture I drew storing the exact pixel value of every single pixel is probably not the most efficient way to store it you know what if instead we said like oh you know let's say this is like green rushes or something it might say like oh over here is green and everything kind of underneath it's pretty much the same or you know maybe I'm wearing a blue top in this beautiful portrait and it could kind of say like oh all the pixels in here are blue you know you don't really have to do everyone individually there are faster more concise ways of storing what an image is we know this is true because for example a JPEG picture is far fewer bytes than the number of bytes you would get if you multiply the type bytes width by its channels so we know that it's possible to compress pictures so let me show you a really interesting way to compress pictures let's take this image and let's put it through a convolutional layer of stride 2 now if we put it through a convolutional layer stride 2 with six features with six channels we would get back a 256 by 256 gosh that was a terrible attempt at drawing a square wasn't it 256 by 256 actually do it here by okay let's double the number of channels to six by six and then let's put it through another stride 2 convolution and remember we're going to be seeing exactly how to do all these things and building them all from scratch so don't worry if you're not sure what a stride 2 convolution exactly is and just do it again to get 128 by 128 and again let's double the number of channels and then let's do it again another stride 2 convolution so we're just building a neural network here so now we're down to 64 by 64 by 24 okay and then now let's put that through a few like resnet blocks to kind of squish down the number of channels as much as we can so it'll be now down to let's say 64 by 64 by 4 okay so here's a neural network and so the number of pixels in this version is now 64 times 64 times 4 16 384 so there's 16 384 pixels here okay so we've compressed it from 786 432 to 16 384 which is a 48 times decrease now that's no use if we've lost our image so can we get the image back again sure why not what if we now create a kind of an inverse convolution which does the exact opposite so actually let's put it over here so we're going to take our 64 by 64 by 4 image put it through an inverse convolution so let's put it let's keep moving this over further back to 128 by 128 by 12 and put it through another inverse convolution so these are all just basically they're just neural network layers 256 by 256 by 6 and then finally wrap you out all the way back to 512 by 512 by 3 okay we could put this whole thing inside a neural net here's our single neural network and what we could do is we can start feeding it images it goes all the way through this neural network and out of the other end comes back well initially it's random so initially comes out of this is random noise 512 by 512 because I draw it inside here so inside here initially it's going to give us random noise and so now we need a loss function right so the loss function we can create could be to say let's take this output and this input and compare them and create and do an MSE mean squared error directly on those two pieces so what would that do if we train this model this model is going to try to put an image through and going to try to make it so that what comes out the other end is the exact same thing that went into it that's what it's going to try to do because if it does that successfully then the mean squared error would be zero so I see some people in the chat saying that this is a unit this is not a unit okay we'll get to that later there's no cross connections it's just a bunch of convolutions that decrease inside followed by a bunch of convolutions that increase in size and so we're going to try to train this model to spit out exactly what it received in and that seems really boring what's the point of a model that only learns to give you back exactly what came in well this is actually extremely interesting this kind of model is called an auto encoder it's something that gives you back what you gave it and the reason an auto encoder is interesting is because we can split it in half let's grab just this bit okay let's cut it up let's grab that just that bit and then we'll get a second half okay they're not quite halves but you know what I mean which is just this bit and so let's say I take this image and I put it through just this first half this green half which is called the encoder okay I can take this this thing that comes out of it and I could save it and the thing that I'm going to save is going to be 16,384 bytes I started with something that was 48 times bigger than that 786,432 bytes and I've turned it into something that 16,384 bytes I could now attach that to an email say or whatever and I've now got something that's 48 times smaller than my original picture so what's going to happen the person who receives these 16,384 bytes well as long as they have a copy of the decoder on their computer they can feed those bytes into the decoder and get back the original image so what we've just done is we've created a compression algorithm that's pretty amazing isn't it and in fact these compression algorithms work extremely extremely well and notice that we didn't train this on just this one image we've trained it on say millions and millions of images right and then so you and I both need to have a copy of these two neural nets right but now we can share thousands of pictures that we send each other by sending just the 16,384 byte version so we've created a very powerful compression algorithm and so maybe you can see where this is going if this here is something which contains all of the interesting and useful information of the image in 16,384 bytes why on earth would we train our unit with 786,432 pixels of information and the answer is we wouldn't that would be stupid instead we're going to do this entire thing using our encoded version of each picture so if we want to train this unit on 10 million pictures we put all 10 million pictures through the autoencoders encoder so we've now got 10 million of these smaller things and then we feed it into the unit training hundreds or thousands of times to train our unit and so what will that unit now do something slightly different to what we described it does not anymore take a somewhat noisy image instead it takes a somewhat noisy one of these so it probably helped to give this thing a name and so the name we give this thing is Latence these are called the Latents okay so instead the input is somewhat noisy Latents the output is still the noise and so we can now subtract the noise from the slightly noise somewhat noisy Latents and that would give us the actual Latents and so we can then take the output of the unit and pass it into our autoencoders decoder sorry yes decoder because that's something which okay that's something which takes Latents and turns it into a picture so the input to this is small Latents tensor and the output is a large image okay this thing here is not going to be called an encoder it's going to have the name the VAE and we'll learn about why later those details are too important but let's put its correct name here the VAE's decoder so you're only going to need the encoder for the VAE if you're training a unit if you want to just do inference like we did today you're only going to need the decoder of the unit so this whole thing of Latents is entirely optional right this thing we described before works fine but you know generally speaking we would rather not use more compute than necessary so unless you're trying to sell the world a room full of TPUs you would probably rather everybody was doing stuff on the thing that's 48 times smaller so the VAE is optional but it saves us a whole lot of time and a whole lot of money so that's good okay what's next well there's something else which is we have not just been in this morning's you know it's in sorry in the first half of today's lesson we weren't just saying produce me an image we were saying produce me an image of tiny the teddy bear riding a horse so how does that bit work so the way that bit works is actually on the whole pretty straightforward let's think about how we could do exactly that for our amnesty example how could we get this so that rather than just feeding in noise and getting back some digit how do we get it to give us a particular digit what if we wanted to pass in the literal number three plus some noise and have yet attempt to generate a handwritten three for us how would we do that well what we could do is way back here for the input to this model why don't in is in addition to passing in the noisy input let's also pass in a one hot encoded version of what digit it is so we're now passing two things into this model so previously this neural net took as inputs just the pixels but now it's going to take in the pixels and what digit is it as like a one hot encoded vector so now it's going to learn how to predict what is the noise right and it's going to predict what is the noise and it's going to be have some extra information which is it's going to know what the original image was so we would expect this model to be better at predicting noise than the previous one because we're giving it more information this was a three this was a six this was a seven so this neural net is going to learn to estimate noise better by taking advantage of the fact that it knows what actual the input was and why is that useful well the reason that's useful is because now when we feed in the number three like the actual digit three is a one hot encoded vector plus noise after this has been trained then our model is going to say the noise is everything that doesn't represent the number three because that's what it's learned to do right so that's a pretty straightforward way to give it the word we use is guidance about what it is that we're actually trying to remove the noise from and so then we can use that guidance to guide it as to what image we're trying to create so that's the basic idea now the problem is if we want to create a picture of a cute teddy bear we've got a problem it was easy enough to pass the digit eight the literal the literal number eight into our neural net because we can just create a one hot encoded vector in which position number eight is a one and everything else is a zero but how do we do that for a cute teddy we can't we can't create every possible sentence that can be uttered in the whole world and then create a one hot encoded version one hot encoded version of every sentence in the world because that's going to take a vector that is too long to say the least so we have to do something else to turn this into an embedding something other than grabbing a one hot encoded version of this so what do we do so what we do so what we're going to do is we're going to try to create a model that can take a sentence like a cute teddy and can return a vector of numbers that in some way represents what cute teddies look like and the way we're going to do that is we're first going to surf the internet and download images so here are four examples of images that I found on the internet and so for each of these images they had an image tag next to them right and if people are being good then they also added an alt tag to help with accessibility and maybe for SEO purposes and they probably said things like a graceful swan and the alt tag for this might have been a scene from Hitchcock's the birds and the alt tag for the bite this might have been Jeremy Howard and the alt tag for this might have been fast.ai's logo and we could do that for millions and millions and millions of images that we find on the internet so what we can now do with these is we can create two models one model which is a text encoder and one model which is an image encoder okay so again these are neural nets we don't care about what their architectures are or whatever we know that they're just black boxes which contain weights which means they need inputs and outputs in a loss function and then they'll do something once we've defined inputs and outputs in a loss function the neural nets will then do something so here's a really interesting idea what if we take this image and what if we then also take the text a graceful swan okay and we're going to feed these into their respective models which initially they of course have random weights and that means that they're going to spit out random features the vector of stuff random crap because we haven't trained them yet okay and we can do the same thing with a scene from Hitchcock we pass the scene from Hitchcock in and we'll pass in the words seen from Hitchcock and then they'll give us two other vectors right and so we can do something really interesting now we can line these up because we'll just move them we can line these up okay here's all of our images okay and then we can have oopsie does he okay and then we can have our text so we've got graceful swan we've got Hitchcock we've got Jeremy Howard and we've got fast AI logo now ideally when we pass the graceful swan through our model what we'd like is that it creates a set of embeddings that are a good match for the text graceful swan when we pass the scene from Hitchcock through our image model we would like it to return embeddings which are similar to the embeddings for the text seen from Hitchcock and ditto for the picture of Jeremy Howard versus the name Jeremy Howard and ditto for the image fast AI and sorry the fast AI logo in the word fast AI logo so in other words for this particular combination here we would like this one's features and this one's features to be similar so how do we tell if two sets of things are similar two vectors well what we can do is we can simply multiply them together element wise and add them up and this thing here is called the dot product and so we could take the dot product of the features from the image model for this one and the dot product of the features from the text model for the word graceful swan and take their dot product and we want that number to be nice and big and the scene from Hitchcock's features should be very similar to the text scene from Hitchcock's features so we want their product to be nice and big and ditto for everything on this diagonal now on the other hand a graceful swan picture should not have embeddings that are similar to the text a scene from Hitchcock so that should be nice and small and ditto for everything else off diagonal and so perhaps you can see where this is going if we add up all of these right add those all together and then subtract all of these we have a loss function and so if we want this loss function to be good then we're going to need the weights of our model for the text encoder to spit out embeddings that are very similar to the images that they're paired with and we need them to spit out embed features for things that they are not paired with which are not similar and so if we can do that then we're going to end up with a text encoder that we can feed in things like a graceful swan some beautiful swan such a lovely swan and these should all give very similar embeddings because these would all represent very similar pictures and so what we've now done is we successfully created two models that together put text and images into the same space so we've got this multimodal set of models which is exactly what we wanted so now we can take our cute teddy bear feed it in here get out some features and that is what we will use instead of these one hot encoded vectors when we train our photo or painting or whatever unit and then we can do exactly the same thing with guidance we can now pass in the text encoders feature vector for a cute teddy and it is going to turn the noise into something that is similar to things that it's previously seen that are cute teddies so the model that's used for the pair of models that's used here is called clip this thing where we want these to be bigger and these to be smaller is called a contrastive loss and now you know where the CL comes from so here we have a clip text encoder its input is some text its output is we call it an embedding it's just some features whoops embedding where similar sets of text meaning with similar meanings will give us similar embeddings because we need a bit more space we're nearly done so we've got the unit that can denoise latency into unnoisy latency including pure noise we've got a decoder that can take latency and create an image we've got a text encoder which can allow us to train a unit which is guided by captions so the last thing we need is the question about how exactly do we do this inference process here so how exactly once we've got something that gives us the gradients we want and by the way these gradients are often called the score function just in case you come across that that's all that's referring to yeah how exactly do we go about this process and unfortunately the language used around this is weird and confusing and so ideally you will learn to ignore the fact that the language is weird and confusing and in particular the language you'll see a lot talks about time steps and you'll notice that during our training process we never used any concept of time steps this is basically an overhang from the particular way in which the math was formulated in the first papers there are lots of other ways we can formulate it and during the course on the whole where we'll avoid using the term time steps but we can to see what time steps are even though it's got nothing to do with time in real life consider the fact that we used varying levels of noise some things were very noisy some things were not noisy at all some things had no noise and some I haven't drawn here would have been pure noise you could basically create a kind of a noising schedule where along here you could put say the numbers from one to a thousand and you could then say oh you know and we'll call this T and maybe we randomly pick a number from one to a thousand and then we look up on this noise schedule which will be some monotonically decreasing function and we look up let's say we happen to pick randomly a number four we would look up here to find where that is we'd look over here and this would return to us some sigma which is the amount of noise to use if you happen to get a four so if you happen to get a one you're going to get a whole lot of noise and if you happen to get a 1000 you're going to have hardly any noise so this is one way of like picking so remember we were going to pick in when we were training we're going to pick for every image a random amount of noise so this would be one way to do that is to pick a random number from one to a thousand look it up on this function and that tells us how much noise to use so this T is what people refer to as the time step nowadays you don't really have to do that that much and a lot of people are starting to get rid of this idea or together and some people instead will simply say how much noise was there normally we would think of using sigma for standard deviations of Gaussians or normal distributions but actually much more common is to use the Greek letter beta and so if you see something talking about beta they're just saying oh for that particular image when it was being trained what standard deviation of noise was being used basically slightly hand wavy but close enough and so what you do each time you're going to create a mini batch to pass into your model you randomly pick an image from your training set you randomly pick either an amount of noise or some models you randomly pick a T and then look up an amount of noise and then you put use that amount of noise for each one and then you pass that mini batch into your model to train it and that trains the weights in your model so it can learn to predict noise and so then when you come to inference time so inference is when you're generating a picture from pure noise you want your model you're basically your model is now starting here right which is as much noise as possible and so you want it to learn to remove noise but what it does in practice as we saw in our notebook is it actually creates some hideous and rather random kind of thing so in fact let's remind ourselves what that looked like this is what it created when we tried to do it in one step so remember what we then do is we say okay what's the prediction of the noise and then we multiply the prediction of the noise I said by some constant it's kind of like a learning rate but we're not updating weights now we're updating pixels and we subtract it from the pixels so it didn't actually predict the image what it actually did was it predicted what the noise is so that could then subtract that from the image from the noisy image to give us the denoised image and so what we do is we don't actually subtract all of it we multiply that by a constant and we get a somewhat noisy image the reason if we don't jump all the way to the best image we can find is because things that look like this never appeared in our training set and so since it never appeared in our training set our model has no idea what to do with it our model only knows how to deal with things that look like somewhat noisy latents and so that's why we subtract just a bit of the noise so that we still have a somewhat noisy latent so this process repeats a bunch of times and the questions like what do we use for C right and how do we go from the prediction of noise to the thing that we subtract these are all of the things that are the kind of the things that you decide in the actual sampler and that's used both to think about like how do I add the noise and how do I subtract the noise and there's a few things that might be jumping into your head at this point if you're anything like me and one is that gosh this looks an awful lot like deep learning optimizers so in a deep learning optimizer this constant is called the learning rate and we have some neat tricks where we say for example oh if you change the same parameters by a similar amount multiple times in multiple steps maybe you should increase the amount you change them this concept is something we call momentum and we'll be doing all this from scratch during the course don't worry and in fact we even got better ways of doing that where we kind of say well what about what happens as the variance changes maybe we can look at that as well and that gives us something called atom and these types of optimizer and so maybe you might be wondering could we use these kinds of tricks and the answer based on our very early research is yes yes we can the whole world of like where stable diffusion and all these diffusion based models came from came from a very different world of maths which is the world of differential equations and there's a whole lot of very parallel concepts in the world of differential equations which is really all about taking these like little steps little steps little steps and trying to figure out how to take bigger steps and so different differential equation solvers use a lot of the same kind of ideas if you squint as optimizers one thing that differential equations solvers do which is kind of interesting though is that they tend to take T as an input and in fact pretty much all diffusion models I've actually lied pretty much all diffusion models don't just take the input pixels and the digit or the caption or the prompt they also take T and the idea is that the model will be better at removing the noise if you tell it how much noise there is and remember this is related to how much noise there is I very strongly suspect that this premise is incorrect because if you think about it for a complicated fancy neural net figuring out how noisy something is is very very straightforward so I very much doubt we actually need to pass in T and as soon as you stop doing that things stop looking like differential equations and they start looking more like optimizers and so actually John O's started playing with this and experimenting a bit and early results suggest that yeah actually when we rethink about the whole thing as being about learning rates and optimizers maybe it actually works a bit better in fact there's all kinds of things we could do once we stop thinking about them as differential equations and worry about the math don't worry about the math so much about Gaussians and whatever we can really switch things around so for example we decided for no particular obvious reason to use MSE well the truth is in statistics and machine learning almost every time you see somebody use MSE it's because the math worked out better that way not as in it's a better thing to do but as in you know it was kind of easier now MSE does fall out quite nicely as being a good thing to do with some particular premises you know like it's not like totally arbitrary but what if we instead used more sophisticated loss functions where we actually said well you know after we subtract the outputs how how good is this really does it look like a digit or does it have the similar qualities to a digit so we'll learn about this stuff but there's things called for example perceptual loss or another question is do we really need to do this thing where we actually put noise back at all could we instead use this directly these are all things that suddenly become possible when we start thinking of this as an optimization problem rather than a differential equation solving problem so for those of you are interested in kind of doing novel research this is some of the kind of stuff that we are starting to research at the moment and the early results are extremely positive both in terms of how quickly we can do things and what kind of outputs we seem to be getting okay so I think that's probably a good place to stop it so what we're going to do in the next lesson is we're going to finish our journey into this notebook to see some of the code behind the scenes of what's in a pipeline when I get there so we'll do looking inside the pipeline and see exactly what's going on behind the scenes a bit more in terms of the code and then we're going to do a huge rewind to the from the foundations and we're going to build up from some very tricky ground rules our ground rules would be we're only allowed to use pure Python the Python standard library and nothing else and build up from there until we have recreated all of this and possibly some new research directions at the same time so that's our goal and so strap in and see you all next time