 Hi everybody. Welcome to lesson 11. This is the third lesson in part 2, depending on how you count things. There's been a lesson A and a lesson B. It's kind of the fifth lesson in part 2. I don't know what it is. So we'll just stick to calling it lesson 11 and avoid getting too confused. I'm already confused. My goodness, I've got so much stuff to show you. I'm only going to show you a tiny fraction of the cool stuff that's been happening on the forum this week, but it's been amazing. I'm going to start by sharing this beautiful video from John Robinson, and I should say. And I've never seen anything like this before. As you can see, it's very stable and it's really showing this beautiful movement between seasons. So what I did on the forum, as I said to folks, hey, you should try interpolating between prompts, which is what John did. And I also said you should try using the last image of the previous prompt interpolation as the initial image for the next prompt. And anyway, here it is. It came out beautifully. John was the first to get that working. So I was very excited about that. And the second one I wanted to show you is this really amazing work from Seb Derhey, who, Sebastian, who did something that I've been thinking about as well. I'm really thrilled that he also thought about this, which was he noticed that this update we do, unconditional embeddings plus guidance times text embeddings minus unconditional embeddings, has a bit of a problem, which is that it gets big. To show you what I mean by it gets big is like, imagine that we've got, we've got a couple of vectors on this chart here. Okay. And so we've got, let's see, so we've got, that's just, okay, so we've got the original unconditional piece here. So we've got U. So let's say this is U. Okay. And then we add to that some amount of T minus U. So if we've got like T, let's say it's huge, right? And we've got U again. Then the difference between those is the vector which goes here, right? Now, you can see here that if there's a big difference between T and U, then the eventual update which actually happens is, oopsie, Daisy, I thought that was going to be an arrow. Let's try that again. The eventual update which happens is far bigger than the original update. And so it jumps too far. So this idea is basically to say, well, let's make it so that the update is no longer than the original unconditioned update would have been. And we're going to be talking more about norms later. But basically we scale it by the ratio of the norms. And what happens is we start with this astronaut and we move to this astronaut. And it's kind of, it's a subtle change, but you can see there's a lot more before, after. Before, after a lot more texture in the background. And like on the, on the earth, there's a lot more detail before, after. You see that? And even little things like before the bridle kind of rains, whatever were pretty flimsy, now they look quite proper. So it's made quite a big difference just to kind of get this scaling correct. So another example, there's a couple of other things that Sebastia tried, which I'll explain in a moment, but you can see how they, some of them actually resulted in changing the, the image. And this one's actually important because the poor horse used to be missing a leg. And now it's not missing a leg. So that's good. And so here's the detailed one with its extra leg. So how did he do this? Well, so what he did was he started with this unconditioned prompt plus the guidance times the difference between the conditional and unconditioned. And then as we discussed, the next version, the, well actually the next version we then saw is to basically just take that prediction and scale it according to the difference in the lengths. So the norms is basically the lengths of the vectors. And so this, this is the second one I did in lesson nine. You'll see it's gone from here. So when we go from one A to one B, you can see here it's got, look at this, this boot's gone from nothing to having texture. This, I don't know what over the hell this thing is suddenly he's got texture. And look, we've now got proper stars in the sky. It's made a really big difference. And then the second change is not just to rescale the whole prediction, but to rescale the update. And when we rescale the update, it actually not surprisingly changes the image entirely because we're now changing the direction it goes. And so I don't know, is this better than this? I mean, maybe, maybe not. But you know, I think so, you know, particularly because this was the difference that added the correct fourth leg to the horse before. And then we can do both. We can rescale the difference and then rescale the result. And then we get the best of both worlds. As you can see, big difference, we get a nice background. This weird thing on his backs actually become an arm. That's not what a foot looks like. That is what a foot looks like. So these little details make a big difference, as you can see. So this is a really cool or two really cool new things. New things tend to have wrinkles, though. Wrinkle number one is after I shared on Twitter, Sebastian's approach, Ben Paul, who's a Google brain, I think, if I remember correctly, pointed out that this already exists. He thinks it's the same as what's shown in this paper, which is a diffusion model for text to speech. I haven't read the paper yet to check whether it's got all the different options or whether it's checked them all out like this. So maybe this is reinventing something that already existed. I'm putting it into a new field, which it'll be interesting. Anyway, so hopefully folks on the forum, you can help figure out whether this paper is actually showing the same thing or not. And then the other interesting thing was John Robinson got back in touch on the forum and said, oh, actually that tree video doesn't actually do what we think it does at all. There's a bug in his code. And despite the bug, it accidentally worked really well. So now we're in this interesting question of trying to figure out like, oh, why did we, how did he create such a beautiful video by mistake? And okay, so reverse engineering exactly what the bug did. And then figuring out how to do that more intentionally. And this is great, right? It's really good to, you know, having a lot of people working on something and the bugs often, yeah, they tell us about new new ideas. So that's very interesting. So what's the space? Well, we find out what what John actually did and how come it worked so well. And then something that I just saw like two hours ago on the forum, which I never thought of before, but I thought of something a little bit similar. Rekha Prashanth said, like, well, what if we took this so you can see all the students are really bouncing ideas of each other is like, oh, it's interesting, we're doing different things with a guidance scale. What if we take the guidance scale, and rather than keeping it at 7.5 all the time, let's reduce it. And this is a little bit similar to something I suggested to John over a few weeks ago, where I said, he was doing some stuff with like modifying gradients based on additional loss functions. And I said to him, maybe you should just use them like occasionally at the start. Because I think the key thing is once the model kind of knows roughly what image it's trying to draw, even if it's noisy, you know, you can let it do its thing. And this is exactly what's happening here is is is Rekha's idea is to say, well, let's let's decrease the guidance scales. So at the end, it's basically zero. And so once it kind of is in going in the right direction, we let it do its thing. So this little doggy is with the normal 7.5 guidance scale. Now have a look, for example, it's I here, it's pretty decent, uninteresting, pretty flat. And if I go to the next one, as you can see, now actually look at the eye, that's a proper eye before totally glassy black, now proper eye, or like look at all this fur, very textured previously, very out of focus. So this is again a new technique. So I love this, you know, you folks are trying things out. And some things are working and some things not working. And that's all that's all good. I kind of feel like you're going to have to slow down because I'm having trouble keeping up with you all. But apart from that, this is great. Good work. I also wanted to mention on a different theme, to check out Alex's notes on the on the lesson, because I thought he's done a fantastic job of showing like how to how to study, how to study a lesson. And so what Alex did, for example, was he made a list in his notes of all the different steps we did as we started the front of the foundations. What is the library that it comes from links to the documentation. And I know that Alex's background actually is history, you know, not not computer science. And so you know, for somebody moving into a different field like this, this is a great idea, you know, particularly to be able to like look at like, okay, what are all the things that I'm going to have to learn and read about. And then he did something which we always recommend, which is to try the lesson on a new data set. And he very sensibly picked out the fashion MNIST data set, which is something we'll be using a lot in this course, because it's a lot like MNIST. And it's just different enough to be interesting. And so he described in his post or his notes, how he went about doing that. And then something else I thought was interesting in his notes at the very end was he just jotted down my tips. It's very easy when I throw a tip out there to think, Oh, that's interesting. That's good to know. And then it can disappear. So here's a good way to make sure you don't forget about all the little little tricks. And I think I've put those notes in the forum wiki so you can you can check them out if you've if you'd like to learn from them as well. So I think this is a great role model. Good job, Alex. Okay, so during the week, Jono taught us about a new paper that had just come out called Def Edit. And he told us he thought this was an interesting paper. And it came out during the week. And I thought it might be good practice for us to try reading this paper together. So let's do that. So here's the paper, Def Edit. And you'll find that probably the majority of papers that you come across in deep learning will take you to archive archive is a pre print server. So these are models are these are papers that have not been peer reviewed. I would say in our field, we don't generally or I certainly don't generally care about that at all, because we have code, we can try it, we can see things whether it works or not. You know, we tend to be very, you know, most papers are very transparent about here's what we did and how we did it and you can replicate it. And it gets a huge amount of peer review on Twitter. So there's a problem generally within 24 hours, somebody has pointed it out. So we use archive a lot. And if you wait until it's been peer reviewed, you know, you'll be way out of date because this field is moving so quickly. So here, so here it is an archive and we can read it by clicking on the PDF button. I don't do that. Instead, I click on this little button up here, which is the save to Zotero button. So I figured I'd show you like my preferred workflows, you don't have to do the same thing. There are different workflows, but here's one that I find works very well, which is a Zotero is a piece of free software that you can download for Mac Windows Linux and install a Chrome connector. Oh, Tanishka saying the buttons covered. Alright, so in my taskbar, I have a button that I can click that says save to Zotero. Sorry, not taskbar Chrome menu bar. And when I click it, I'll show you what happens. So after I've downloaded this, the paper will automatically appear here in this software, which is Zotero. And so here it is diff edit. And you can see it's told us it's got here the abstract, the authors, where it came from. And so later on, I can go and like, if I want to check some detail, I can go back and see the URL, I can click on it pops up. And so in this case, what I'm going to do is I'm going to double click on it. And that brings up the paper. Now, the reason I like to read my papers in Zotero is that I can, you know, annotate them, edit them, tag them, put them in folders and so forth. And also add them to my kind of reading list directly from my web browser. So as you can see, you know, I've started this fast diffusion folder, which is actually a group library, which I share with the other folks working on this fast diffusion project that we're all doing together. And so we can all see the same paper library. So Maribu on YouTube chat is asking, is this better than Mendeley? Yeah, I used to use Mendeley and it's kind of gone downhill. I think Zotero is far, far better, but they're both very similar. Okay, so we double click on it. It opens up. And here is a paper. So reading a paper is always extremely intimidating. And so you just have to do it anyway. And you have to realize that your goal is not to understand every word. Your goal is to understand the basic idea well enough that for example, when you look at the code, hopefully it comes with code, most things do, that you'll be able to kind of see how the code matches to it, and that you could try writing your own code to implement parts of it yourself. So over on the left, you can open up the sidebar here. So I generally open up the table of contents and get a bit of a sense of, okay, so there's some experimental results. There's some theoretical results. Introduction related work, okay, tells us about this new diff edit thing, some experiments. Okay, so that's a pretty standard approach that you would see in papers. So I would always start with the abstract. Okay, so what's it saying this does? So generally, it's going to be some background sentence or two about how interesting this field is, it's just saying, well, image generation score, which is fine. And then they're going to tell us what they're going to do, which is they're going to create something called diff edit. And so this is a, what is it for? It's going to use text condition diffusion models. So we know what those are now, that's what we've been using. That's where we type in some text and get back an image of that that matches the text. But this is going to be different. It's the task of semantic image editing. Okay, we don't know what that is yet. So let's put that aside and think, okay, let's make sure we understand that later. The goal is to edit an image based on a text query. Oh, okay. So we're going to edit an image based on text. How on earth would you do that? They're going to tell us right away what this is. Semantic image editing. It's an extension of image generation with an additional constraint, which is the generated image should be as similar as possible to the given input. And so generally, as they've done here, there's going to be a picture that shows us what's going on. And so in this picture, you can see here, an example, here's an input image. And originally it was attached to a caption, a bowl of fruits. Okay, we want to change this into a bowl of pears. So we type a bowl of pears. And it generates, oh, a bowl of pears. Or we could change it from a bowl of fruit to a basket of fruits. And oh, it's become a basket of fruits. Okay, so I think I get the idea, right? What it's saying is that we can edit an image by typing what we want that image to represent. So this actually looks a lot like the paper that we looked at last week. So that's cool. So the abstract says that currently, so I guess there are current ways of doing this, but they require you to provide a mask. That means you have to basically draw the area you're replacing. Okay, so that sounds really annoying, but our main contribution, so what this paper does, is we automatically generate the mask. So they simply just type in the new query and get the new image. So that sounds actually really impressive. So if you read the abstract and you think, I don't care about doing that, then you can skip the paper, you know, or look at the results. And if the results don't look impressive, then just skip the paper. So that's that's kind of your first point where we can be like, okay, we're done. But in this case, this sounds great. The results look amazing. So I think we should keep going. Okay, it achieves data, the uploading performance, of course. Fine. Let me try some, whatever. Okay. So the introduction to a paper is going to try to give you a sense of, you know, what they're trying to do. And so this first paragraph here is just repeating what we've already read in the abstract and repeating what we see in Figure 1. So it's saying that we can take a text query like a basket of fruits, see the examples. All right, fine, we'll skip through there. So the key thing about academic papers is that they are full of citations. You should not expect to read all of them. Because if you do, then to read each of those citations, that's full of citations and then they're full of citations. And before you know it, you read the entire academic literature, which has taken you 5,000 years. So for now, let's just recognize that it says text conditional image generation is undergoing a revolution. Here's some examples. Well, fine, we actually already know that. Okay, Dali's call, latent diffusion, that's what we've been using, that's call, the image n, apparently that's call. So cool. All right, so we kind of know that. So generally, there's this like, okay, our area that we're working on is important. In this case, we already agree it's important. So we can skip through it pretty quickly. There are vast amounts of data are used. Yes, we know. Okay, so diffusion models are interesting. Yes, we know that. They de-noise starting from Gaussian noise. We know that. So you can see like, there's a lot of stuff. Once you kind of in the field, you can skip over pretty quickly. You can guide it using clip guidance. Yeah, that's what we've been doing. We know about that. Oh, wait, this is new. Or by in painting, by copy, pasting pixel values outside a mask. All right, so there's a new technique that we haven't done. But I think it makes a lot of intuitive sense that is during that diffusion process, if there are some pixels you don't want to change, such as all the ones that aren't orange here, you can just paste them from the original after each stage of the diffusion. All right, that makes perfect sense. If I want to know more about that, I could always look at this paper, but I don't think I do for now. Okay, and again, it's just repeating something they've already told us that they require us to provide a mask. So that's a bit of a problem. And then, you know, this is interesting. It also says that when you mask out an area, that's a problem because if you're trying to, for example, change a dog into a cat, you want to keep the animals color and pose. So this is a new technique, which is not deleting the original, not deleting a section and replacing it with something else, but it's actually going to take advantage of knowledge about what that thing looked like. So that this is two cool new things. So hopefully at this point, we know what they're trying to achieve. If you don't know what they're trying to achieve when you're reading a paper, the paper won't make any sense. So again, that's a point where you should stop. Maybe this is not the right time to be reading this paper. Maybe you need to read some of the references. Maybe you need to look at more of the examples. So you can always skip straight to the experiments. So I often skip straight to the experiments. In this case, I don't need to because they've put enough experiments on the very first page for me to see what it's doing. So yeah, don't always read it from top to bottom. Okay. So all right. So they've got some examples of conditioning a diffusion model on an input without a mask. Okay. For example, you can use a noise version of the input as a starting point. Hey, we've done that too. So as you can see, we've already covered a lot of the techniques that they're referring to here. Something we haven't done but makes a lot of sense is that we can look at the distance to the input image as a loss function. Okay. That makes sense to me and there's some references here. All right. So we're going to create this new thing called diff edit. It's going to be amazing. Wait till you check it out. Okay, fine. Okay. So that's the introduction. Hopefully you found that useful to understand what we're trying to do. The next section is generally called related work as it is here and that's going to tell us about other approaches. So if you're doing a deep dive, this is a good thing to study carefully. I don't think we're going to do a deep dive right now. So I think we can happily skip over it. We could kind of do a quick glance of like, oh, image editing, conclude colorization, retouching, style transfer. Okay, cool. Lots of interesting topics. Definitely getting more excited about this idea of image editing and there's some different techniques. You can use clip guidance. Okay. They can be computational expensive. We can use diffusion for image editing. Okay, fine. We can use clip to help us. So there's a lot of repetition in these papers as well, which is nice because we can skip over it pretty quickly. More about the high computational costs. Okay, so they're saying this is going to be not so computationally expensive. That sounds hopeful. And often the very end of the related work is most interesting as it is here where they've talked about how somebody else has done, okay, concurrent to ours, somebody else is working at exactly the same time. And they've looked at some different approach. Okay, so I'm not sure we learned too much from the related work, but if you were trying to really do the very, very best possible thing, you could study the related work and get the best ideas from each. Okay, now background. So this is where it starts to look scary. I think we could all agree. And this is often the scariest bit, the background. This is basically saying like mathematically here's how the problem that we're trying to solve is set up. And so we're going to start by looking at denoising diffusion probabilistic models, DD, PM. Now if you've watched lesson 9b with Waseem and Tanishk, then you've already seen some of the math of DD, PM. And the important thing to recognise is that basically no one in the world pretty much is going to look at these paragraphs of text and these equations and go, oh, I get it. That's what DD, PM is. That's not how it works. Right. To understand DD, PM, you would have to read and study the original paper. And then you would have to read and study the papers it's based on and talk to lots of people and watch videos and go to classes just like this one. And after a while, you'll understand DD, PM. And then you'll be able to look at this section and say, oh, okay, I see they're just talking about this thing I'm already familiar with. So this is meant to be a reminder of something that you already know. It's not something you should expect to learn from scratch. So let me take you through these equations somewhat briefly, because Waseem and Tanishk have kind of done them already, because every diffusion paper pretty much is going to have these equations. Okay. So, oh, and I'm just going to read something that Jono's put it out in the chat. He says, it's worth remembering the background is often written last and tries to look smart for the reviewers, which is correct. So feel free to read it last too. Yeah, absolutely. I think the main reason to read it is to find out what the different letters mean, what the different symbols mean, because they'll probably refer to them later. But in this case, I want to actually take this as a way to learn how to read math. So let's start with this very first equation, which how on earth do you even read this? So the first thing I'll say is that this is not an E, right? It's a weird looking E. And the reason it's a weird looking E is because it's a Greek letter. And so something I always recommend to students is that you learn the Greek alphabet, because it's much easier to be able to actually read this to yourself. So here's another one, right? If you don't know that's called theta, I guess you have to read it as like circle with line through it. It's just going to get confusing trying to read an equation where you just can't actually say it out loud. So what I suggest is that you learn that, learn the Greek alphabet, and let me find the right place. So it's very easy to look it up just on Wikipedia. Here's the Greek alphabet. And if we go down here you'll see they've all got names and we can try and find our one, curvy E, okay. Here it is, epsilon and circle with a line through it, theta. All right, so practice and you will get used to recognizing these. So you've got epsilon, theta. This is just a weird curly L, so that's this is used for the loss function. Okay, so how do we find out what this symbol means and what this symbol means? Well, what we can do is there's a few ways to do it. One way which is kind of cool is we can use a program called math picks which math picks. Here we are math picks. And what it does is you basically select anything on your screen and it will turn it into latex. So that's one way you can do this is you can select on the screen, it turns it into latex and the reason it's good to turn it into latex is because latex is written as actual stuff that you can search for on Google. So that's technique number one. Technique number two is you can download the other formats of the paper and that will have a download source and if we say download source then what we'll be able to do is we'll be able to actually open up that latex and have a look at it. So we'll wait for that to download while it's happening. Let's keep moving along here. So in this case we've got these these two bars. So can we find out what that means? So we could try a few things. We could try looking for two bars, maybe math notation. Oh here we are, looks hopeful. What does this mean in mathematics? Oh and here there's a glossary of mathematical symbols. Here there's a meaning of this in math. So that looks hopeful. Okay so it definitely doesn't look like this. It's not between two sets of letters. Ah but it is around something. That looks hopeful. So it looks like we found it. It's a vector norm. Okay so then you can start looking for these things up. So we can say norm or maybe vector norm. And so once you can actually find the term then we kind of know what to look for. Okay so in our case we've got this surrounding all this stuff and then there's twos here and here. What's going on here? All right if we scroll through, oh this is pretty close actually. So okay so two bars can mean a matrix norm. Otherwise a single for a vector norm that's just here in particular. So it looks like we don't have to worry too much about whether it's one or two bars. Oh and here's the definition. Oh that's handy. So we've got the two one. All right so it's equal to, oh root sum of squirts. So that's good to know. So this norm thing means a root sum of squirts. But then we've got a two up here. Well that just means squared. Ah so this is a root sum of squirts squared. Well the square of a square root is just the thing itself. Ah so actually this whole thing is just the sum of squirts. It's a bit of a weird way to write it in a sense. We could perfectly well have just written it as you know like sum of you know whatever it is squared. Fine. But there we go. Okay and then what about this thing here? Weird E thing. So how would you find out what the weird E thing is? Okay so our our lay tech has finally finished downloading. And if we open it up we can find there's a .tech file in here. Here we are main .tech. So we'll open it. And it's not the most you know amazingly smooth process but you know what we could just do is we could say okay it's just after it says minimizing the denoising objective. Okay so let's search for minimizing the D. Oh here it is minimizing the denoising objective. So the lay tech here let's get it both on the screen at the same time. Okay so here it is L. Mathcal L equals math b b e x naught t epsilon. Okay and here's that vertical bar thing epsilon minus epsilon theta x t and then the bar thing 2 2. All right so the thing that we've got new is math b b e. Okay so finally we've got something we can search for. Math b b e. Ah fantastic what does math b b e mean? That's the expected value operator. Aha fantastic. All right so it takes a bit of fussing around but once you've got either math picks working or actually another thing you could try because math picks is ridiculously expensive in my opinion is there is a free version called picks to tech that actually is a Python thing and you could actually even have fun playing with this because the whole thing is just a PyTorch Python script and it even describes you know how I've used a transformers model and you can train it yourself in colab and so forth but basically as you can see yeah you can snip and convert to lay tech which is pretty awesome. So you could use this instead of paying the math picks guys. Anyway so we are on the right track now I think so expected value and then we can start reading about what expected value is and you might actually remember that because we did a bit of it in high school at least in Australia we did. It's basically like let's maybe jump over here. So expected value of something is saying what's the likely value of that thing. So for example let's say you toss a coin which could be heads or it could be tails and you want to know how often it's heads and so maybe we call heads one tails zero so you toss it and you get a one zero zero one one zero one zero one okay and so forth right and then you can calculate the mean of that right so that's x you can calculate x bar the mean which would be the sum of all that divided by the count of all that. So it'd be one two three four five five divided by one two three four five six seven eight nine okay so that would be the mean but the expected value is like well what do you expect to happen and we can calculate that by adding up for all of the possibilities for each I don't know what is called them x for each possibility x how likely is x and what score do you get if you get x. So in this example of heads and tails our two possibilities is that we either get heads or we get tails. So if for the version where x is heads we get probability is 0.5 and the score if it's an x because I should use that the score if it's an x is going to be one and then what about tails for tails the probability is 0.5 and the score if you get tails is zero and so overall the expected is 0.5 times one plus zero is 0.5 so I expected score if we're tossing a coin this 0.5 if getting heads is a win. Let me give you another example another example is let's say that we're rolling a die and we want to know what the expected score is if we roll a die. So again we could roll it a bunch of times and see what happens okay and so we could sum all that up last night before and divide it by the count and that will tell us the mean for this particular example but what's the expected value more generally? Well again it's the sum of all the possibilities of the probability of each possibility times that score. So the possibilities for rolling a die is that you can get a 1, a 2, a 3, a 4, a 5 or a 6. The probability of each one is a sixth okay and the score that you get is well it's this this is the score and so then you can multiply all these together and sum them up which will be one sixth plus two sixths plus three sixths plus four sixths oops plus five sixths plus six sixths and that would give you the expected value of that particular thing which is rolling die rolling a rolling a die. So that's what expected value means. All right so that's a really important concept that's going to come up a lot as we read papers and so in particular this is telling us what are all the things that we're averaging it over that with the expectations over and so there's a whole lot of letters here you're not expected to just know what they are in fact in every paper they could mean totally different things so you have to look immediately underneath where they'll be defined. So x0 is an image it's an input image. Epsilon is the noise and the noise has a mean of zero and a standard deviation of i which if you watch the lesson 9b you'll know it's like a standard deviation of one when you're doing multiple normal variables. Okay and then this is kind of confusing. Eta just on its own is a normally distributed random variables it's just grabbing random numbers but Eta theta sorry epsilon but epsilon theta is a noise estimator that means it's a function. You can tell it's a function kind of because it's got these parentheses and stuff right next to it so that's a function so presumably most functions like this in these papers and neural networks. Okay so we're finally at a point where this actually is going to make perfect sense. We've got the noise we've got the prediction of that noise we subtract one from the other we square it and we take the expected value. So in other words this is mean squared error. So well that's a lot of fiddling around to find out that we've this whole thing here mean squared error. So the loss function is the mean squared error and unfortunately I don't think the paper ever says that it says minimizing the denoising objective l bloody bloody bloody but anyway we got there eventually. Fine we also as well as learning about x naught we also learn here about x t and so x t is the original unnoised image times some number plus some noise times one minus that number. Okay and so hopefully you'll recognize this from lesson 9b this is the thing where we reduce the value of each pixel and we add noise to each pixel. So that's that. All right so I'm not going to keep going through it but you can kind of basically get the idea here is that once you know what you're looking for the equations do actually make sense right but all this is doing is remember this is background right this is telling you what already exists this is telling you this is what a ddpm is and then it tells you what a ddim is. Ddim is look just think of it as a more recent version of ddpm it's some very minor changes to the way it's set up which allows us to go faster. Okay so the thing is though once we keep reading what you'll find is none of this background actually matters but you know I thought we'd kind of go through it just to get a sense of like what's in a paper. Okay so for the purpose of our background it's enough to know that ddpm and ddim are kind of the foundational papers on which diffusion models today are based. Okay so the encoding process which encodes an image onto a latent variable okay and then this is basically adding noise this is called ddim encoding and the thing that goes from the input image to the noise image they're going to call capital ER and R is the encoding ratios that's going to be some like how much noise are we adding. If you use small steps then decoding that so going backwards gives you back the original image okay so that's what the stuff that we've learned about that's what diffusion models are. All right so this looks like a very useful picture so maybe let's take a look and see what this says so what is dif edit dif edit has three steps step one we add noise to the input image that sounds pretty normal here's our input image x0 okay and we add noise to it fine and then we denoise it okay fine ah but we denoise it twice one time we denoise it using the reference text R horse or this special symbol here means nothing at all so either unconditional or horse all right so we do it once using the word horse so we take this and we decode it estimate the noise and then we can remove that noise on the assumption that it's a horse uh then we do it again but the second time we do that noise when we calculate the noise we pass in our query q which is zebra wow those are going to be very different noises the noise for horse is just going to be literally these Gaussian pixels these are all dots right because it is a horse but if the claim is no no this is actually a zebra then all of these pixels here are all wrong they're all the wrong color so the noise that's calculated if we say this is our query it's going to be totally different to the noise if we say this is our query and so then we just take one minus the other and here it is here right so we derive a mask based on the difference in the denoising results and then you take that and binarize it so basically turn that into ones and zeros so that's actually the key idea that's a really cool idea which is that once you have a diffusion model that's trained you can do inference on it where you tell it the truth about what the thing is and then you can do it again but lie about what the thing is and in your lying version it's going to say okay all the stuff that doesn't match zebra must be noise and so the difference between the noise prediction when you say hey it's a zebra versus the noise prediction when you say hey it's a horse will be all the pixels that it says no these pixels are not zebra the rest of it it's fine there's nothing particularly about the background that wouldn't work with a zebra okay so that's step one so then step two is we take the horse and we add noise to it okay that's this xr thing that we learned about before and then step three we do coding conditioned on the text query using the mask to replace the background with pixel values ah so this is like the idea that we heard about before which is that during the inference time as you do diffusion from this fuzzy horse what happens is that we do a step of diffusion inference and then all these black pixels we replace with the noise version of the original and so we do that multiple times and so that means that the original pixels in this black area won't get changed and that's why you can see in this picture here and this picture here the background's all the same and the only thing that's changed is that the horse has been turned into a zebra so this paragraph describes it and then you can see here it gives you a lot more detail and the detail often has all kinds of like little tips about things they tried and things they found which is pretty cool so i won't read through all that because it says the same as what i've already just said one of the interesting little things they noted note here actually is that this binarized mask so this difference between the r decoding and the q decoding tends to be a bit bigger than the actual area where the horse is which you can kind of see with these legs for example and their point is that they actually say that's a good thing because actually often you want to slightly change some of the details around the object so this is actually fine all right so we have a description of what the thing is lots of details there and then here's the bit that i totally skip the bit called theoretical analysis where this is the stuff that people really generally just add to try to get their papers past review you have to have fancy math and so they're basically proving you can see what it says here insight into why this component yields better editing results than other approaches i'm not sure we particularly care because like it makes perfect sense what they're doing it's intuitive and we can see it works i don't feel like i need it proven to me so i skip over that so then they'll show us their experiments to tell us what data sets they did the experiments on and so then you know they have metrics with names like lpips and csfid you'll come across fid a lot this is just a version of that we're basically they're trying to score how good their generated images we don't normally care about that either they care because they need to be able to say you should publish our paper because it has a higher number than the other people that have worked on this area in our case we can just say you know it looks good i like it so excellent question in the chat from michelage which is so with this only work on things that are relatively similar and i think this is a great point right this is where understanding this helps to know what its limitations are going to be and that's exactly right if if you can't come up with a mask for the change you want this isn't going to work very well on on the whole yeah because those the masked areas the pixel is going to be copied so for example if you wanted to change it from you know a bowl of fruits to a bowl of fruits with a bokeh background i don't know or like a bowl of fruits with you know with a you know a purple tinged photo of a bowl of fruit if you want the whole color to change that's not going to work right because you're not masking off an area yeah so by understanding the detail here michelage is correctly recognized a limitation or or like what's this for this is the things where you can just say just change this bit and leave everything else the same all right so there's lots of experiments so yeah for some things you care about the experiments a lot if it's something like classification for stuff for for generation the main thing you probably want to look at is the actual results and so and often for whatever reason i guess because this is most people read these electronically the results often you have to zoom into a lot to be able to see whether they're really good so here's the input image they want to turn this into an english foxhound so here's the thing they're comparing themselves to st edit and changed the composition quite a lot and their version it hasn't changed it at all it's only changed the dog and ditto here semi trailer truck st edits totally changed it diff edit hasn't so you can kind of get a sense of like you know the authors showing off what they're good at here this is this is what this technique is effective at doing changing animals and vehicles and so forth it does a very good job of it all right so then there's going to be a conclusion at the end which i find almost never adds anything on top of what we've already read and as you can see it's very short anyway um now quite often the appendices are really interesting so don't skip over them often you'll find like more examples of pictures they might show you some examples of pictures that didn't work very well stuff like that so it's often well worth looking at the appendices often some of the most interesting examples are there um and that's it all right so that is i guess our first full-on paper walkthrough and you know it's important to remember this is not like a carefully chosen paper that that we've picked specifically because you can handle it like this is the most interesting paper that came out this week um and so you know it gives you a sense of what it's really like and um uh for those of you who are you know ready to try something that's going to stretch you see if you can implement um any of this paper so there are three steps um the first step is kind of the most interesting one which is to generate automatically generate a mask and um the information that you have and the code that's in the lesson nine notebook actually contains everything you need to do it so maybe give it a go see if you can mask out the area of a horse that does not look like a zebra and that's actually you know that's actually useful of itself like that's that's allows you to create um segmentation masks automatically so that's pretty cool um and then if you get that working then you can go and try and do um step two if you get that working you can try and do step three and this only came out this week so i haven't really seen um yeah examples of of uh easy to use interfaces to this so here's an example of a paper that you could be the first person to create a call interface to it so there's some yeah there's a fun little um project and even if you're watching this a long time after this was released and everybody's been doing this for years um still good homework i think to to practice if you can all right i think now's a good time to um have a 10 minute break so i'll see you all back here in um 10 minutes okay welcome back um one thing um during the break that Diego reminded us about which um i normally describe and i totally forgot about this time is detectify um which is another really great way to find symbols you don't know about so let's try it for that expectation so if you've got to detectify and you draw the thing it doesn't always work fantastically well but sometimes it works very nicely um yeah in this case not quite what about the double line thing it's good to know all the techniques i guess you'd think it could do this one i guess part of the problem is there's so many options that actually you know okay in this case it wasn't particularly helpful and normally it's more helpful than that i mean if we use a simple one like epsilon i think it should be fine there's a lot of room to improve this app actually if anybody's interested in a project um i think you could make it you know more successful okay that's there you go sigma sum that's cool anyway so there's it's another useful thing to know about just google for detectify okay so um let's move on with our from the foundations now and so we were working on trying to uh at least get the start of a forward pass of a linear model or a simple multi-layer perceptron for MNIST going and we had successfully created a basic tensor we've got some random numbers going so what we now need to do is we now need to be able to multiply these things together matrix multiplication so matrix multiplication to remind you in this case so we're doing MNIST right so we've got um i think we're going to use a subset let's see yeah okay so we're going to create a matrix called M1 which is just the first five digits so M1 will be the first five digits so five rows and dot dot dot and then 780 what is it again 784 columns 784 columns because it's 28 by 28 pixels and we flattened it out so this is our first matrix in our matrix multiplication and then we're going to multiply that by some some weights so the weights are going to be 784 by 10 random numbers so for every one of these 784 pixels each one is going to have a weight so 784 down here 784 by 10 so this first column for example is going to tell us all the weights in order to figure out if something's a zero and the second column will have all the weights in deciding the probability of something's a one so forth assuming we're just doing a linear model and so then we're going to multiply these two matrices together so when we multiply matrices together we take row one of matrix one and we take column one of matrix two and we take each one in turns we take this one we take this one we multiply them together and then we take this one and this one and we multiply them together and we do that for every element wise pair and then we add them all up and that would give us the value for the very first cell that would go in here that's what matrix multiplication is okay so let's go ahead then and create our random numbers for the weights since we're allowed to use random number generators now and for the bias we'll just use a bunch of zeros to start with so the bias is just what we're going to add to each one and so for our matrix multiplication we're going to be doing a little mini batch here we're going to be doing five rows of as we discussed five rows of so five five images flattened out and then what applied by this weights matrix so here are the shapes m1 is 5 by 784 as we saw m2 was 784 by 10 okay so keep those in mind so here's a handy thing m1 dot shape contains two numbers and i want to pull them out i want to call the uh i'm going to think of that as i'm going to actually think of this as like a and b rather than everyone in m2 so this is like a and b so the number of rows in a and the number of columns in a if i say equals m1 dot shape that will put five in ar and seven eight four in ac so you'll probably notice this i do this a lot this destructuring we talked about it last week too so we can do the same for m2 dot shape put that into b rows and b columns and so now if i write out ar ac and brbc you can again see the same things from the sizes so that's a good way to kind of give us the stuff we have to loop through so here's our results so our resultant tensor well we're multiplying we're multiplying together all of these seven eight four things and adding them up so the resultant tensor is going to be five by ten and then each thing in here is the result of multiplying and adding seven hundred eighty four pairs so the result here is going to start with zeros and there was this is the result and it's going to contain ar rows five rows and bc columns ten columns five comma ten okay so now we have to fill that in and so to do a matrix multiplication we have to first we have to go through each row one at a time and here we have that go through each row one at a time and then go through each column one at a time and then we have to go through each pair in that row column one at a time so there's going to be a loop in a loop in a loop so here's we're going to loop over each row and here we're going to loop over each column and then here we're going to loop so each column is c and then here we're going to loop over each column of a which is going to be the same as the number of rows of b which we can see here ac seven eight four br seven eight four they're the same so it wouldn't matter whether we said ac or br so then our result for that row and that column we have to add onto it the product of i k in the first matrix by k j in the second matrix so k is going up through those seven eight four and so we're going to go across the columns and down sorry across the rows and down the columns it's going to go across the row well as it goes down this column so here is the world's most naive slow uninteresting matrix multiplication and if we run it okay it's done something we have successfully apparently hopefully successfully multiplied the matrices m1 and m2 it's a little hard to read this i find because because punch cards used to be 80 columns wide we still assume screens 80 columns wide everything defaults to 80 wide which is ridiculous but you can easily change it so if you say set print options you can choose your own line width um oh you can see uh well we know that's five by ten we did it before uh so if we change the line width okay that's much easier to read now we can see here the five rows and here are the 10 columns for that matrix multiplication um i tend to always put this at the top of my notebooks and you can do the same thing for numpy as well so what i like to do um this is really important is uh when i'm working on on code particularly numeric code i like to do it all step by step in jupiter and then what i do is once i've got it working is i copy all the cells that have implemented that and i paste them and then i select them all and i hit shift m to merge get rid of anything that prints out stuff i don't need and then i put a header on the top give it a function name and then i select the whole lot and i hit uh control or apple right square bracket and i've turned it into a function but i still keep the stuff above it so i can see all the step by step stuff for learning about it later and so that's what i've done here to create this function and so this function does exactly the same things we just did and we can see how long it takes to run by using percent time and it took about half a second which gosh that's a long time to generate such a small matrix this is just to do five m-ness digits so that's not going to be great are we going to have to speed that up um i'm i'm actually quite surprised at how slow that is because there's only 39 200 so we're you know if you look at the how we've got a loop within a loop within a loop what's going wrong way a loop within a loop within a loop it's doing 39 200 of these so python yeah python when you're just doing python it is it is slow so we can't we can't do that that's why we can't just write python um but there is something that kind of lets us write python we could instead use number uh number is um a system that takes python and um turns it into um basically into machine code um and it's amazingly easy to do you can basically take a function and write njit at njit on top and what it's going to do is it's going to look the first time you call this function it's going to compile it down to machine code and it will run much more quickly so what i've done here is i've taken the innermost loop so just looping through and adding up all these so start at zero go through and add up all those just for two vectors and return it this is called a dot product in linear algebra so we'll call it dot um and so number only works with numpy doesn't work with pytorch so we're just going to use arrays instead of tensors for a moment now have a look at this if i try to do a dot product of one two three and two three four it's pretty easy to do it took a fifth of a second which sounds terrible but the reason it took a fifth of a second is because that's actually how long it took to compile this and run it now that it's compiled the second time it just has to call it it's now 21 micro seconds and so that's actually very fast so with number we can basically make python run at C speed so now the important thing to recognize is if i replace this loop in python with a called a dot which is running in machine code then we now have one two loops running in python not three so our four hundred and forty eight milliseconds well first of all let's make sure if i run it run that map mole it should be close to my t1 t1 is what we got before remember so when i'm refactoring or performance improving or whatever i always like to put every step in the notebook and then test so this test close comes from fast core test and it just checks that two things are very similar they might not be exactly the same because of little floating point differences which is fine okay so our map mole is working correctly or at least it's doing the same thing it did before so if we now run it it's taking 268 micro seconds okay versus four hundred and forty eight milliseconds so it's taking you know about two thousand times faster just by changing the one in a most loop so really all we've done is we've added at engine to make it two thousand times faster so number is well worth knowing about they can make your python code very very fast okay let's keep making it faster so we're going to use stuff again which kind of goes back to APL and a lot of people say that learning APL is a thing that's taught the more about programming than anything else so it's probably worth considering learning APL and let's just look at these various things we've got a is 10 six minus four so remember at APL we don't say equals equals actually means equals funnily enough we to say set two we'll use this arrow and it's this is a list of 10 six four okay and then b is 287 okay and we're going to add them up a plus b so what's going on here so it's really important that you can think of a symbol like a as representing a tensor or an array APL calls them arrays apply torches calls them tensors numpy calls them arrays they're the same thing so this is a single thing that contains a bunch of numbers this is a single thing that contains a bunch of numbers this is an operation that applies to arrays or tensors and what it does is it works what's called element wise it takes each pair ten and two and adds them together each pair six and eight add them together this is element wise addition and fred's asking in the chat how do you put put in these symbols if you just mouse over any of them it will show you how to write it and the one you want is the one at the very bottom which is the one where it says prefix now the prefix is the back tick character so here it's saying prefix hyphen gives us times so if i type hyphen there we go so okay a back tick dash b is a times b for example so yeah they all have shortcut keys which you learn pretty quickly i find and there's a fairly consistent kind of system for those shortcut keys too all right so we can do the same thing in pytorch it's a little bit more verbose in pytorch which is one reason i often like to do my mathematical fiddling around in a pl i can often do it with less boilerplate which means i can spend more time thinking you know i can see everything on the screen at once i don't have to spend as much time trying to like ignore the tensor round brackets square bracket dot comma blah blah it's all cognitive load which i'd rather ignore but anyway it does the same thing so i can say a plus b and it works exactly like a pl so here's an interesting example i can go a less than b dot float dot mean so let's try that one over here a less than b so this is a really important idea which i think was invented by ken iverson the apl guy which is the true and false represented by zero and one and because they're represented by zero and one we can do things to them we can add them up and subtract them and so forth it's a really important idea um so in this case i want to take the mean of them and i'm going to tell you something amazing um which is that in apl there is no function called mean why not that's because we can write the mean function which so that's four letters mean me a n we can write the mean function from scratch with four characters i'll show you here is the whole mean function we're going to create a function called mean and the mean is equal to the sum of a list divided by the count of a list so this here is sum divided by count and so i have now defined a new function called mean which calculates the mean mean of a is less than b there we go and so you know in practice i'm not sure people would even bother defining a function called mean because it's just as easy to actually write its implementation in apl um in numpy or whatever python it's going to take a lot more than four letters to implement mean uh so anyway you know it's a math notation and so being a math notation we can do a lot with little which i find helpful because i can see everything going on at once uh anyhow okay so that's how we do the same thing in high torch and again you can see that the less than in both cases are operating element wise okay so a is less than b is saying 10 is less than two six is less than eight four is less than seven and gives us back each of those trues and falses add zeros and ones and according to the emoji on our youtube chat severs head just exploded as it should uh this is why apl is yeah um life changing okay let's now go up to higher ranks so this here is a rank one tensor uh so a rank one tensor means it's a it's a list of things it's a vector um it's where else a rank two tensor is like a list of lists they all have to be the same length lists or it's like a rectangular bunch of numbers and we call it in math we call it a matrix so this is how we can create a tensor containing one two three four five six seven eight nine um and you can see often what i like to do is um i want to print out the thing i just created it after i created it so two ways to do it you can say put an enter and then write m and that's going to do that or if you want to put it all on the same line that works too you just use a semicolon neither one's better than the other they're just different um so we could do the same thing in apl um of course in apl it's going to be much easier so we're going to define a matrix called m um which is going to be a three by three uh tensor containing the numbers from one to nine okay and there we go that's done it in apl a three by three tensor containing the numbers from one to nine um a lot of these ideas from apl you'll find have made their way into other programming languages for example if you use go you might recognize this this is the aorta character and go uses i the word aorta so they spell it out in a somewhat similar way um a lot of these ideas from apl have found themselves into math notation and um and other languages it's been around since the late 50s um okay so here's a bit of fun we're going to learn about a new thing that looks kind of crazy called Frobenius norm and we'll use that from time to time as we're doing generative modeling and here's the definition of a Frobenius norm uh it's the sum over uh all of the rows and columns of a matrix uh and we're going to take each one and square it you're going to add them up and they're going to take the square root and so uh to implement that in pi torch is as simple as going n times m dot sum dot square root so this looks like a pretty complicated thing when you kind of look at it at first it looks like a lot of squiggly business or if you said this thing here you might be like what on earth is that well now you know it's just um square sum square root um so again we could do the same thing in apl um so let's do so in apl we want the okay so we're going to create something called sf um now it's interesting um apl does this a little bit differently so dot sum by default in pi torch sums over everything and if you want to sum over just one dimension you have to pass in a dimension keyword for very good reasons apl is the opposite it just sums across rows or just down columns so actually we have to say sum up uh the flattened out version of the matrix and to say flattened out you use comma so here's sum up the flattened out version of the matrix okay so that's our sf oh sorry and the matrix is meant to be m times m there you go so there's the same thing sum up the flattened out m by m matrix and another interesting thing about apl is it always is read right to left there's no such thing as operator precedence which makes life a lot easier um okay and then we take the square root of that there isn't a square root function so we have to do to the power of 0.5 and there we go same thing all right you get the idea yes a very interesting question here from maribou are the bars for norm or absolute value and i'd like to see if it's answer which is um the norm is the same as the absolute value for a scalar um so in this case you can think of it at absolute value and it's kind of not needed because it's being squared anyway um but yes in this case the norm well in every case for a scalar the norm is the absolute value which is kind of a cute discovery when you realize it so thank you for pointing that out siva um all right so this is just fiddling around a little bit to kind of get a sense of how these things work um so really importantly you can um index into a matrix and you'll say rows first and then columns and if you say colon it means all the columns so if i say row two here it is row two all the columns sorry this is row two starts at zero apl starts at one all the columns that's going to be seven eight nine and you can see i often use comma to print out multiple things and i don't have to say print in in a jupiter it's kind of assumed and so this is just a quick way of printing out the second row and then here every row column two so here is every row of column two and here you can see three six nine um so one thing very useful to recognize is that uh for tensors of um higher rank than one such as a matrix any trailing colons are optional so you see this here m2 that's the same as m2 comma colon that's really important to remember okay so m2 you can see the result is the same so that means row two every column okay so now with all that in place we've got quite an easy way we don't need um uh number anymore we can multiply so we can get rid of that uh in a most loop so we're going to get rid of this loop because this is just multiplying together all of the corresponding rows of um a with the with all sorry all the corresponding columns of a row of a with all the corresponding rows of a column of b and so we can just use an element wise operation for that so here is the ith row of a and here is the jth column of b and so those are both as we've seen just vectors and therefore we can do an element wise modification of them and then sum them up and that's the same as a dot product um so that's handy and so again we'll do test close okay it's the same great and again you'll see we we kind of did all of our experimenting first right to make sure we understood how it all worked and then put it together and then if we time it 661 microseconds okay it's interesting it's actually slower than which really shows you how good number is but it's certainly a hell of a lot better than our 450 milliseconds but we're using something that's kind of a lot more general now um this is exactly the same as dot as we've discussed so we could just use torch dot torch dot dot I suppose I should say and if we run that okay a little faster it's still interestingly this is still slower than the number which is um quite amazing actually all right so that's uh that one was not exactly a speed up but it's kind of a bit more general which is nice um now we're going to get something into something really fun which is broadcasting and broadcasting is about what if you have arrays with different shapes so what's a shape the shape is the number of rows or the number of rows and columns or the number of what would you say faces rows and columns and so forth so for example the shape of m is three by three so what happens if you multiply or add or do operations to tensors of different shapes well there's one very simple one which is if you've got a rank one tensor the vector then you can use any operation with a scalar and it broadcasts that scalar across the tensor so a is greater than zero is exactly the same is saying a is greater than tensor zero comma zero comma zero so it's basically copying that across three times now it's not literally making a copy in memory but it's acting as if we had said that and this is the most simple version of broadcasting okay it's broadcasting the zero across the ten and the six and the negative four and apl does exactly the same thing a is less than five so zero zero one so same idea okay um so we can do plus with a scalar and we can do exactly the same thing with higher than rank one so two times a matrix is just going to do two is going to be broadcast across all the rows and all the columns okay now it gets interesting so broadcasting dates back to apl but a really interesting idea is that we can broadcast not just scalars but we can broadcast vectors across matrices or broadcast any kind of lower-ranked tensor across higher-ranked tensors or even broadcast together together two tensors of the same rank but different shapes and a really powerful way and as i was exploring this i was trying to i love doing this kind of computer archaeology i was trying to find out where the hell this comes from and it actually turns out from this email message in 1995 um that the idea actually comes from a language that i'd never heard of called yorick which still apparently exists here's yorick and so yorick has talks about broadcasting and conformability so what happened is this um this very obscure language um uh as this very powerful idea and numpy has has has happily stolen the idea uh from yorick that allows us to broadcast uh together um tensors that don't appear to match so let me give an example here's a tensor called c that's a vector it's a rank one tensor 10 2030 and here's a tensor called m which is a matrix we've seen this one before and one of them is shape 3 comma 3 the other is shape 3 and yet we can add them together now what's happened when we added it together well what's happened is 10 2030 got added to 123 and then 10 2030 got added to 456 and then 10 2030 got added to 789 and hopefully you can see this looks quite familiar instead of broadcasting a scalar over a higher rank tensor this is broadcasting a vector across every row of a matrix and it works both ways so we can say c plus m gives us exactly the same thing and so let me explain what's actually happening here the trick is to know about this somewhat obscure method called expand as and what expand as does is this creates a new thing called t which contains exactly the same thing as c but expanded or kind of copied over so it has the same shape as m so here's what t looks like now t contains exactly the same thing as c does but it's got three copies of it now and you can see we can definitely add t to m because they match shapes right so we can say m plus t we know we can play m plus t because we've already learned that you can do element wise operations on two things that have matching shapes now by the way this thing t didn't actually create three copies check this out if we call t dot storage it tells us what's actually in memory it actually just contains the numbers 10 2030 but it does a really clever trick it has a stride of zero across the rows and a size of three comma three and so what that means is that it acts as if it's a three by three matrix and each time it goes to the next row it actually stays exactly where it is and this idea of strides is the trick which numpy and pytorch and so forth use for all kinds of things where you basically can create you know very efficient ways to to do things like expanding or to kind of jump over things and stuff like that you know switch between columns and rows stuff like that anyway the important thing here first to recognize is that we didn't actually make a copy this is totally efficient it's all going to be run in c code very fast so remember this expand ads is critical this is the thing that will teach you to understand how broadcasting works which is really important for implementing deep learning algorithms or any kind of linear algebra on any python system because the numpy rules are used exactly the same in jacks in tensorflow and pytorch and so forth now i'll show you a little trick which is going to be very important in a moment if we take c which remember is a vector containing 10 20 30 and we say dot unsqueez zero then it changes the shape from three to one comma three so it changes it from a vector of length three to a matrix of one row by three columns this will turn out to be very important in a moment and you can see how it's printed it's printed out with two square brackets now i never use unsqueez because i much prefer doing something more flexible which is if you index into an axis with a special value none also known as n p dot new axis it does exactly the same thing it inserts a new axis here so here we'll get exactly the same thing one row by all the columns three columns so this is exactly the same as saying unsqueez so this inserts a new unit axis this is a unit axis a single row in this dimension and this does the same thing so these are the same so we could do the same thing and say unsqueez one which means now we're going to unsqueez into the first dimension so that means we now have three rows and one column see the shape here the shape is inserting a unit axis in position one three rows and one column and so we can do exactly the same thing here give us every row and a new unit axis in position one same thing okay so those two are exactly the same so this is how we create a matrix with one row this is how we create a matrix with one column none comma colon versus colon comma none or unsqueez we don't have to say as we've learned before none comma colon because do you remember trailing colons are optional so therefore just c none is also going to give you a row matrix one row matrix um this is a little trick here if you say dot dot dot that means all of the dimensions and so dot dot dot comma none will always insert a unit axis at the end regardless of what rank a tensor is so yeah so none and np new axis mean exactly the same thing np new axis is actually a synonym for none if you've ever used that i always use none um because why not short and simple so here's something interesting if we go c colon comma none so let's go and check out what c colon comma none looked like c colon comma none is a column and if we say expand as m which is three by three then it's going to take that 10 20 30 column and replicate it 10 20 30 10 20 30 10 20 30 so we could add so remember like well remember i will explain that when you say matrix plus c colon comma none it's basically going to do this dot expand as for you so if i want to add this matrix here to m i don't need to say dot expand as i just write this i just write m plus c colon comma none and so this is exactly the same as doing um m plus c but now rather than adding the vector to each row it's adding the vector to each column c plus 10 20 30 10 20 30 10 20 30 so that's a really simple way that we now get kind of for free thanks to this really nifty notation there's a nifty approach that came from yorick so here you can see m plus c none comma none comma colon is adding 10 20 30 to each row and m plus c colon comma none is adding 10 20 30 to each column um all right so that's the basic like hand wavy version so let's um look at like what are the rules and how does it work okay so c none comma colon is one by three c colon comma none is three by one what happens if we multiply c none comma colon by c colon comma none well it's going to do if you think about it which you definitely should because thinking is very helpful right it's going on here oh it took forever okay so what happens if we go c none comma colon times c colon comma none so what it's going to have to do is it's going to have to take this 10 20 30 uh column vector or or um three by one matrix and it's going to have to make it work across each of these rows so what it does is expands it to be 10 20 30 10 20 30 10 20 30 so it's going to do it just like this and then it's going to do the same thing for c none comma colon so that's going to become three rows of 10 20 30 so we're going to end up with three rows of 10 20 30 times three columns of 10 20 30 which gives us our answer and so this is going to do an outer product so it's very nifty that you can actually do an outer product um without any special you know functions or anything just using broadcasting and it's not just out of products you can do outer boolean operations and this kind of stuff comes up all the time right now remember you don't need the comma colon so get rid of it so here's just showing us all the places where it's greater than it's kind of an outer an outer boolean if you want to call it that um so this is super nifty and uh you can do all kinds of tricks with this because it runs very very fast so this is going to be accelerated in c so here are the rules okay when you operate on two arrays of tensors numpy and py torch will compare their shapes okay so remember the shape this is a shape you can tell it's a shape because we said shape and it's goes from right to left so that's the trailing dimensions and it checks whether dimensions are compatible now they're compatible if they're equal right so for example if we say m times m then those two shapes are compatible because the because in each case the it's just going to be three right so they're going to be equal so if they're if the shape in that dimension is equal they're compatible or if one of them's one and if one of them's one then that dimension is broadcast to make it the same size as the other so that's why the outer product worked we had a one by three times a three by one and so this one got copied three times to make it this long and this one got copied three times to make it this long okay so those are the rules so the arrays don't have to have the same number of dimensions so this is an example that comes up all the time let's say you've got a 256 by 256 by three array of or tensor of RGB values so you've got an image in other words a three a color image and you want to normalize it so you want to scale each color in the image by a different value so this is how we normalize colors so one way is to you could multiply or divide or whatever multiply the image by a one dimensional array with three values so you've got a one d array so that's just three okay and then the image is 256 by 256 by three and we go right to left and we check are they the same and we say yes they are and then we keep going left and we say are they the same and if it's missing we act as if it's one and if we go keep going if it's missing we act as if it's one so this is going to be the same as doing one by one by three and so this is going to be broadcast this three three elements will be broadcast over all 256 by 256 pixels so this is a super fast and convenient and nice way of normalizing image data with a single expression and this is exactly how we do it in the fast a library in fact so we can use this to dramatically speed up our matrix multiplication let's just grab a single digit just for simplicity and i really like doing this in Jupyter notebooks and if you if you build Jupyter notebooks to explain stuff that you've learnt in this course or ways that you can apply it consider doing this for your readers but add a lot more pros i haven't added pros here because i want to use my voice if i was for example in our book that we published it's all written in notebooks and there's a lot more pros obviously but like really i like to show every example all along the way using simple as possible so let's just grab a single digit so here's the first digit so its shape is it's a 784 long vector okay and remember that our weight matrix is 784 by 10 okay so if we say digit colon common none dot shape then that is a 784 by one row matrix okay so there's our matrix um and so if we then take that 784 by one and expand as m2 it's going to be the same shape as our weight matrix so it's copied our image data for that digit across all of the 10 vectors representing the 10 kind of linear projections we're doing for our linear model and so that means that we can take the digit colon common none so 784 by one and multiply it by the weights and so that's going to get us back 784 by 10 and so what it's doing remember is it's basically looping through each of these 10 784 long vectors and for each one of them it's multiplying it by this digit so that's exactly what we want to do in our matrix multiplication so originally we had uh we're not originally most recently i should say we had this dot product where we were actually looping over j which was the columns of b so we don't have to do that anymore because we can do it all at once by doing exactly what we just did so we can take the i-th row and all the columns and add a axis to the end and then just like we did here multiply it by b and then dot sum and so that is again exactly the same thing that is another matrix multiplication doing it using broadcasting now this is like uh tricky to get your head around and so if you haven't done this kind of broadcasting before it's a really good time to pause the video and look carefully at each of these four cells before and understand what did i do there why did i do it what am i showing you and then experiment with trying to and so remember that we started with m1 zero right so just like we have here a i okay so that's why we've got i comma comma colon comma none because this digit is actually m1 zero this is like m1 zero colon none so this line is doing exactly the same thing as this here plus a sum so let's check if this map mole is the same as it used to be yet still working and the speed of it okay not bad so 137 microseconds so we've now gone from a time from 500 milliseconds to about point one milliseconds funnily enough on my oh actually now i think about it my mac book air is an m2 where else this mac mini isn't m1 so it's a little bit slower so my air was a bit faster than point one milliseconds so overall we've got about a 5000 times speed improvement um so that is pretty exciting and since it's so fast now there's no need to use a mini batch anymore if you remember we used a mini batch of uh of uh where is it of uh five images but now we can actually use the whole data sets it's so fast so now we can do the whole data set there it is we've now got 50 000 by 10 which is what we want and so it's taking us only 656 milliseconds now to do the whole data set so this is actually getting to a point now where we could start to create and train some simple models in a reasonable amount of time so that's good news all right um i think that's probably a good time to take a break we don't have too much more of this to go but i don't want to keep you guys up too late um so hopefully uh you learnt something interesting about broadcasting today i cannot overemphasize how widely useful this is in all deep learning and machine learning code it comes up all the time it's basically our number one most critical kind of foundational operation so yeah take your time practicing it and also good luck with your diffusion homework from the first half of the lesson thanks for joining and i'll see you next time