 Hi everybody, nice to see you all here. Can you guys all hear me okay? I don't have too much logistic stuff to mention other than that, well, we'll see what happens. I have a feeling this course is going to go a lot longer than I expected. So just putting that out there to warn you right now. It could be more of a marathon than we originally thought, which may require having some breaks in the middle or something. Anyway, we've got a lot of stuff to cover. And I don't want to hurry, I want to do it all carefully and properly. So I decided rather than hurrying, we'll just do what it takes. Alright, so I think we are ready to get into it. Never ending course exactly, Sam. Never ending story. Hi everybody, welcome to lesson 11. This is the third lesson in part two, depending on how you count things. There's been a lesson A and a lesson B. It's kind of the fifth lesson in part two. I don't know what it is. So we'll just stick to calling it lesson 11 and avoid getting too confused. I'm already confused. My goodness, I've got so much stuff to show you. I'm only going to show you a tiny fraction of the cool stuff that's been happening on the forum this week, but it's been amazing. I'm going to start by sharing this beautiful video from John Robinson, I should say. And I've never seen anything like this before. As you can see, it's very stable and it's really showing this beautiful movement between seasons. So what I did on the forum was I said to folks, hey, you should try interpolating between prompts, which is what John did. And I also said you should try using the last image of the previous prompt interpolation as the initial image for the next prompt. And anyway, here it is, came out beautifully. John was the first to get that working, so I was very excited about that. And the second one I wanted to show you is this really amazing work from Seb Derhey, Sebastian, who did something that I'd been thinking about as well. I'm really thrilled that he also thought about this, which was he noticed that this update we do, unconditional embeddings plus guidance times text embeddings minus unconditional embeddings, has a bit of a problem, which is that it gets big. To show you what I mean by it gets big is like imagine that we've got a couple of vectors, right? Oops, that's not a vector. Let's try that again. We've got a couple of vectors on this chart here. Okay. And so we've got, let's see, so we've got, that's just, okay, so we've got the original unconditional piece here. So we've got U. So let's say this is U. Okay. And then we add to that some amount of T minus U. So if we've got like T, let's say it's huge, right? And we've got U again. Then the difference between those is the vector which goes here, right? Now you can see here that if there's a big difference between T and U, then the eventual update which actually happens is, oopsie-daisy, I thought that was going to be an arrow. Let's try that again. The eventual update which happens is far bigger than the original update. And so it jumps too far. So this idea is basically to say, well, let's make it so that the update is no longer than the original unconditioned update would have been. And we're going to be talking more about norms later. But basically we scale it by the ratio of the norms. And what happens is we start with this astronaut and we move to this astronaut. And it's a subtle change. But you can see there's a lot more before, after, before, after, a lot more texture in the background. And on the Earth there's a lot more detail before, after. You see that? And even little things like before the bridal kind of rains, whatever were pretty flimsy, now they look quite proper. So it's made quite a big difference just to kind of get this scaling correct. So another example, there's a couple of other things that Sebastian tried, which I'll explain in a moment, but you can see how some of them actually resulted in changing the image. And this one's actually important because the poor horse used to be missing a leg. And now it's not missing a leg, so that's good. And so here's the detailed one with its extra leg. So how did he do this? Well, so what he did was he started with this unconditioned prompt plus the guidance times the difference between the conditional and unconditioned. And then as we discussed, the next version, well, actually the next version we then saw is to basically just take that prediction and scale it according to the difference in the lengths. So the norms is basically the lengths of the vectors. And so this is the second one I did in lesson nine. You'll see it's gone from here. So when we go from one A to one B, you can see here it's got, look at this, this boot's gone from nothing to having texture. I don't know whatever the hell this thing is. Suddenly it's got texture. And look, we've now got proper stars in the sky. It's made a really big difference. And then the second change is not just to rescale the whole prediction, but to rescale the update. And when we rescale the update, it actually not surprisingly changes the image entirely because we're now changing the direction it goes. And so I don't know, is this better than this? I mean, maybe, maybe not. But you know, I think so, you know, particularly because this was the difference that added the correct fourth leg to the horse before. And then we can do both. We can rescale the difference and then rescale the result. And then we get the best of both worlds. As you can see, big difference. We get a nice background. This weird thing on his back's actually become an arm. That's not what a foot looks like. That is what a foot looks like. So these little details make a big difference, as you can see. So this is a really cool, or two really cool new things. New things tend to have wrinkles, though. Wrinkle number one is after I shared on Twitter, Sebastian's approach, Ben Paul, who's a Google brain, I think, if I remember correctly, pointed out that this already exists. He thinks it's the same as what's shown in this paper, which is a diffusion model for text to speech. I haven't read the paper yet to check whether it's got all the different options or whether it's checked them all out like this. So maybe this is reinventing something that already existed and putting it into a new field, which it's still be interesting. Anyway, so hopefully folks on the forum, you can help figure out whether this paper's actually showing the same thing or not. And then the other interesting thing was John Robinson got back in touch on the forum and said, oh, actually, that tree video doesn't actually do what we think it does at all. There's a bug in his code, and despite the bug it accidentally worked really well. So now we're in this interesting question of trying to figure out, like, oh, why did we... How did he create such a beautiful video by mistake? And, okay, so reverse engineering exactly what the bug did, and then figuring out how to do that more intentionally. And this is great, right? It's really good to, you know, having a lot of people working on something and the bugs often, yeah, they tell us about new ideas. So that's very interesting. So what's this space? Well, we find out what John actually did and how come it worked so well. And then something that I just saw like two hours ago on the forum, which I'd never thought of before, but I'd thought of something a little bit similar. Rekhael Prashanth said, like, well, what if we took this... So as you can see, all the students are really bouncing ideas of each other. He's like, oh, it's interesting. We're doing different things with a guidance scale. What if we take the guidance scale and rather than keeping it at 7.5 all the time, let's reduce it. And this is a little bit similar to something I suggested to John over a few weeks ago where I said he was doing some stuff with, like, modifying gradients based on additional loss functions. Maybe you should just use them occasionally at the start. Because I think the key thing is once the model kind of knows roughly what image it's trying to draw, even if it's noisy, you know, you can let it do its thing. And this is exactly what's happening here is Rekhael's idea is to say, well, let's decrease the guidance scales at the end. It's basically zero. And so once it kind of is in going in the right direction, we let it do its thing. So this little doggy is with the normal 7.5 guidance scale. Now have a look, for example, it's eye here. It's pretty uninteresting, pretty flat. And if I go to the next one, as you can see now, actually look at the eye. That's a proper eye before. Totally glassy black. Now proper eye. Or, like, look at all this fur, very textured, previously very out of focus. So this is again a new technique. So I love this. You know, you folks are trying things out and some things are working and some things not working and that's all that's all good. I kind of feel like you're going to have to slow down because I'm having trouble keeping up with you all. But apart from that, this is great. Good work. I also wanted to mention on a different theme to check out Alex's notes on the lesson because I thought he's done a fantastic job of showing like how to how to study. How to study a lesson. And so what Alex did, for example, was he made a list in his notes of all the different steps we did as we started the front of the foundations. What is the library that comes from? Links to the documentation. And I know that Alex's background actually is history. Not computer science. For somebody moving into a different field like this, this is a great idea particularly to be able to look at what are all the things that I'm going to have to learn and read about. And then he did something which we always recommend which is to try the lesson on a new data set. And he very sensibly picked out the fashion MNIST data set which is something we'll be using a lot in this course because it's a lot like MNIST. And it's just different enough to be interesting. And so he described in his post or his notes how he went about doing that. And then something else I thought was interesting in his notes at the very end was he just jotted down my tips. It's very easy when I tip out there to think, oh, that's interesting. That's good to know. And then it can disappear. So here's a good way to make sure you don't forget about all the little little tricks. And I think I've put those notes in the forum wiki so you can check them out if you'd like to learn from them as well. So I think this is a great role model. Good job, Alex. Okay, so during the week Jono taught us about a new paper that had just come out called Deaf Edit and he told us he thought this was an interesting paper and it came out during the week and I thought it might be good practice for us to try reading this paper together. So let's do that. So here's the paper, Deaf Edit, and you'll find that probably the majority of papers that you come across in deep learning will take you to Archive. Archive is a pre-print server so these are models, these are papers that have not been peer reviewed. I would say in our field we don't generally, or I certainly don't generally care about that at all because we have code, we can try it we can see things, whether it works or not. Most papers are very transparent about here's what we did and how we did it and you can replicate it and it gets a huge amount of peer review on Twitter so if there's a problem generally within 24 hours somebody has pointed it out. So we use Archive a lot and if you wait until it's been peer reviewed you'll be way out of date because this field is moving so quickly. So here it is on Archive and we can read it by clicking on the pdf button I don't do that instead I click on this little button up here which is the save to Zotero button so I figured I'd show you my preferred workflows you don't have to do the same thing, there are different workflows but here's one that I find works very well which is a Zotero is a piece of free software that you can download for Mac Windows Lytics and install a Chrome connector oh Tanishka is saying the button is covered alright so in my taskbar I have a button that I can click that says save to Zotero sorry not taskbar, Chrome menu bar and when I click it I'll show you what happens so after I've downloaded this the paper will automatically appear here in this software which is Zotero and so here it is diff edit and you can see it's told us it's got here the abstract the authors where it came from and so later on I can go and like if I want to check some detail I can go back and see the URL, I can click on it it pops up and so in this case what I'm going to do is I'm going to double click on it and that brings up the paper now the reason I like to read my papers in Zotero is that I can annotate them, edit them tag them, put them in folders and so forth and also add them to my kind of reading list directly from my web browser so as you can see I started this fast diffusion folder which is actually a group library which I share with the other folks working on this fast diffusion project that we're all doing together and so we can all see the same paper library so Maribu on YouTube chat is asking is this better than Mendele yeah I used to use Mendele and it's kind of gone downhill I think Zotero is far far better but they're both very similar okay so when you double click on it it opens up and here is a paper so reading a paper is always extremely intimidating and so you just have to do it anyway and you have to realize that your goal is not to understand every word your goal is to understand the basic idea well enough that for example when you look at the code with code most things do that you'll be able to kind of see how the code matches to it and that you can try writing your own code to implement parts of it yourself so over on the left you can open up the sidebar here so I generally open up the table of contents and get a bit of a sense of okay so there's some experimental results there's some theoretical results introduction related work about this new diff edit thing some experiments so that's a pretty standard approach that you would see in papers so I would always start with the abstract okay so what's it saying this does so generally it's going to be some background sentence or two about how interesting this field is it's just saying well image generation is cool which is fine and then they're going to tell us what they're going to do which is they're going to create something called diff edit and so this is a what is it for? it's going to use text condition diffusion models so we know what those are now that's what we've been using that's where we type in some text and get back an image of that that matches the text but this is going to be different it's the task of semantic image editing we don't know what that is yet so let's put that aside and think okay let's make sure we understand that later the goal is to edit an image based on a text query oh okay so we're going to edit an image based on text how on earth would you do that ah they're going to tell us right away what this is semantic image editing it's an extension of image generation with an additional constraint which is the generated image should be as similar as possible to the given input and so generally as they've done here there's going to be a picture that shows us what's going on and so in this picture you can see here an example here's an input image and originally it was attached to a caption a bowl of fruits okay we want to change this into a bowl of pears so we type a bowl of pears and it generates oh a bowl of pears or we could change it from a bowl of fruit to a basket of fruits and oh it's become a basket of fruits okay so I think I get the idea right what it's saying is that we can edit an image by typing what we want that image to represent so this actually looks a lot like the paper that we looked at last week so that's cool so the abstract says that currently so I guess there are current ways of doing this but they require you to provide a mask that means you have to basically draw the area you're replacing okay so that sounds really annoying but our main contribution so what this paper does is we automatically generate the mask so they simply just type in the new query and get the new image so that sounds actually really impressive so if you read the abstract and you think I don't care about doing that then you can skip the paper you know or look at the results and if the results don't look impressive then just skip the paper so that's kind of your first point where we can be like okay we're done in this case this sounds great the results look amazing so I think we should keep going okay it achieves data the uploading performance of course fine okay so the introduction to a paper is going to try to give you a sense of what they're trying to do and so this first paragraph here is just repeating what we've already read abstract and repeating what we see in figure one so it's saying that we can take a text query like a basket of fruits see the examples alright fine we'll skip through there so the key thing about academic papers is that they are full of citations um you should not expect to read all of them because if you do then to read each of those citations that's full of citations and then they're full of citations so before you know it you've read the entire academic literature which has taken you 5000 years so for now let's just recognize that it says text conditional image generation undergoing a revolution here some examples well fine we actually already know that okay Darley's call latent diffusion that's what we've been using that's call image n apparently that's call so we kind of know that so generally there's this like okay we already agree it's important so we can skip through it pretty quickly their vast amounts of data are used yes we know okay so diffusion models are interesting yes we know that they de-noise starting from Gaussian noise we know that so you can see like there's a lot of stuff once you kind of in the field you can skip over pretty quickly you can guide it using clip guidance yeah that's what we've been doing we know about that oh wait this is new or by in painting by copy pasting pixel values outside a mask alright so there's a new technique that we haven't done but I think it makes a lot of intuitive sense that is during that diffusion process if there are some pixels you don't want to change such as all the ones that aren't orange here you can just paste them from the original after each stage of the diffusion alright that makes perfect sense if I want to know more about that I could always look at this paper but I don't think I do for now okay and again it's just repeating something they've already told us that there they require us to provide a mask so that's a bit of a problem and then you know this is interesting it's also says that when you mask out an area that's a problem because if you're trying to for example change a dog into a cat you want to keep the animals colour and pose so this is a new technique which is not deleting a section and replacing it with something else but it's actually going to take advantage of knowledge about what that thing looked like so this is two cool new things so hopefully at this point we know what they're trying to achieve if you don't know what they're trying to achieve when you're reading a paper the paper won't make any sense so again that's a point where you should stop maybe this is not the right time to be reading this paper maybe you need to read some of the references maybe you need to look more at the examples so you can always skip straight to the experiments so I often skip straight to the experiments in this case I don't need to because they've put enough experiments on the very first page for me to see what it's doing so yeah don't always read it from top to bottom okay so they've got some examples of conditioning and diffusion model on an input without a mask okay for example you can use a noise version of the input as a starting point hey we've done that too so as you can see we've already covered a lot of the techniques that they're referring to here something we haven't done but makes a lot of sense is that we can look at the distance to the input image as a loss function okay that makes sense to me and there's some references here alright so we're going to create this new thing called diff edit it's going to be amazing wait till you check it out okay fine okay so that's the introduction hopefully you found that useful to understand what we're trying to do the next section is generally called related work as it is here and that's going to tell us about other approaches so if you're doing a deep dive this is a good thing to study carefully I don't think we're going to do a deep dive right now so I think we can happily skip over it we could kind of do a quick glance of like oh image editing conclude colorization retouching style transfer lots of interesting topics definitely getting more excited about this idea of image editing and there's some different techniques you can use clip guidance okay they can be computationally expensive we can use diffusion for image editing okay fine we can use clip to help us so there's a lot of repetition in these papers as well which is nice because we can skip over it pretty quickly more about the high computational costs okay so they're saying this is going to be not so computationally expensive that sounds hopeful and often the very end of the related work is most interesting as it is here where they've talked about how somebody else has done concurrent to ours somebody else is working at exactly the same time and they've looked at some different approach okay so not sure we learned too much from the related work but if you were trying to really do the very very best possible thing you could study the related work and get the best ideas from each okay now background so this is where it starts to look scary I think we could all agree and this is often the scariest bit the background this is basically saying like mathematically here's how the problem that we're trying to solve is set up and so we're going to start by looking at denoising diffusion probabilistic models ddpm now if you've watched lesson 9b with wasim and tnishk then you've already seen some of the math of ddpm and the important thing to recognize is that basically no one in the world pretty much it's going to look at these paragraphs of text and these equations and go oh I get it that's what ddpm is that's not how it works right to understand ddpm you would have to read and study the original paper and then you would have to read and study the papers it's based on and talk to lots of people and watch videos and go to classes just like this one and after a while you'll understand ddpm and then you'll be able to look at this section and say oh okay I see they're just talking about this thing I'm already familiar with so this is meant to be a reminder of something that you already know it's not something you should expect to learn from scratch so let me take you through these equations somewhat briefly because wasim and tnishk have kind of done them already because every diffusion paper much is going to have these equations okay so I'm just going to read something that Jono's pointed out in the chat he says it's worth remembering the background is often written last and tries to look smart for the reviewers which is correct so feel free to read it last too yeah absolutely I think the main reason to read it is to find out what the different letters mean what the different symbols mean I'll probably refer to them later but in this case I want to actually take this as a way to learn how to read math so let's start with this very first equation which how on earth do you even read this so the first thing I'll say is that this is not an E it's a weird looking E and the reason it's a weird looking E is because it's a Greek letter what I recommend to students is that you learn the Greek alphabet because it's much easier to be able to actually read this to yourself so here's another one right if you don't know that's called theta I guess you have to read it as like circle with line through it it's just going to get confusing trying to read an equation where you just can't actually say it out loud so what I suggest is that you learn that learn the Greek alphabet and let me find the right place so it's very easy to look it up just on Wikipedia here's the Greek alphabet and if we go down here you'll see they've all got names and we can try and find our one curvy E okay here it is epsilon and circle with a line through it theta so practice and you will get used to recognizing these so you've got epsilon theta this is just a weird curly L so this is used for the loss function okay so how do we find out what this symbol means well what we can do is there's a few ways to do it one way which is kind of cool is we can use a program called math pics which math pics here we are, math pics and what it does is you basically select anything on your screen and it will turn it into latex so that's one way you can do this is you can select on the screen it turns it into latex and the reason it's good to turn it into latex is because latex is written as actual stuff that you can search for on google so that's technique one number one technique number two is you can download the other formats of the paper and that will have a download source and if we say download source then what we'll be able to do is we'll be able to actually open up that latex and have a look at it so we'll wait for that to download while it's happening let's keep moving along here so in this case we've got these two bars so can we find out what that means so we could try a few things we could try looking for two bars maybe math notation oh here we are looks hopeful what does this mean in mathematics oh and here there's a glossary of mathematical symbols here there's a meaning of this in math so that looks hopeful okay so it definitely doesn't look like this it's not between two sets of letters ah but it is a round something that looks hopeful so it looks like we found it it's a vector norm so then you can start looking for these things up so we can say norm or maybe vector norm and so once you can actually find the term then we kind of know what to look for okay so in our case we've got there's surrounding all this stuff and then there's twos here and here what's going on here alright if we scroll through oh this is pretty close actually um so okay so two bars can be in a matrix norm otherwise a single for a vector norm that's just here in particular so it looks like we don't have to worry too much about whether it's one or two bars oh and here's the definition oh that's handy so we've got the two one alright so it's equal to ah root sum of squareds so that's good to know so this norm thing means a root sum of squareds um but then we've got a two up here well that just means squared ah so this is a root sum of squareds squared well the square of the square root is just the thing itself ah so actually this whole thing is squareds it's a bit of a weird way to write it in a sense we could perfectly well have just written it as you know like sum of you know whatever it is squared fine um but there we go um okay and then what about this thing here weird e thing so how would you find out what the weird e thing is oh my goodness still downloading that's crazy 20k per second that's strange I wonder why that's taking so long alright um maybe if we search for it copy no it's just searching for an e rotten thing well it's kind of speedy fancy e maybe fancy e math symbol weird e letter oh it's finished great okay so our um our um lay tech has finally finished downloading and if we open it up we can find there's a dot tech file in here here we are main dot tech so we'll open it and it's not the most you know amazingly smooth process but you know what we could just do is we could say okay it's just after it says minimizing the denoising objective okay so let's search for minimizing the oh here it is minimizing the denoising objective so the lay tech here let's get it both on the screen at the same time okay so here it is l math cal l equals math b b e x naught t epsilon okay and here's that vertical bar thing epsilon minus epsilon theta x t and then the bar thing two two alright so the thing that we've got new is math b b e okay so finally we've got something we can search for math b b e ah fantastic what does math b b e mean that's the expected value operator aha fantastic alright so it takes a bit of fussing around but once you've got either math pics working or actually another thing you could try because math pics is ridiculously expensive in my opinion is there is there is a free version called pics to tech that actually is a python thing and you could actually even have fun playing with this because the whole thing is just a py torch python script and it even describes you know how if used to transformers model and you can train it yourself in colab and so forth but basically as you can see yeah you can snip and convert to latex which is pretty awesome so you could use this instead of paying the math pics guys anyway so we are on the right track now I think so expected value and then we can start reading about what expected value is and you might actually remember that because we did a bit of it in high school at least in Australia we did it's basically like um let's maybe jump over here um so expected value of something is saying what's the likely value of that thing so for example let's say you toss a coin which could be heads or it could be tails and you want to know how often it's heads and so maybe we'll call heads 1 tails 0 so you toss it and you get a 1 0 0 1 1 0 1 0 1 ok and so forth right and then you can calculate the mean of that right so if that's x you can calculate x bar the mean which would be the sum of all that um divided by the count of all that um so if you want to three four five five divided by one two three four five six seven eight nine ok so that would be the mean um but the expected value is like well what do you expect to happen and we can calculate that by adding up for all of the possibilities um for each channel I want to call them x for each possibility x how likely is x and what score do you get if you get x so in this example of heads and tails our two possibilities is that we either get heads or we get tails so if for the version where x is heads we get probability is 0.5 and the score if it's an x um is that the score if it's an x is going to be 1 and then what about tails for tails the probability is 0.5 and the score if you get tails is 0 and so overall the expected is 0.5 times 1 plus 0 is 0.5 so our expected score if we're tossing a coin is 0.5 if getting heads is a win um let me give you another example another example is let's say that we're rolling a die uh and we want to know what the expected score is if we roll a die so again we could roll it a bunch of times and see what happens okay and so we could sum all that up less than before and divide it by the count um and that will tell us the mean for this particular example but what's the expected value more generally again it's the sum over all the possibilities of the probability of each possibility times that score um so the possibilities for rolling a die is that you can get a 1, a 2, a 3, a 4, a 5 or a 6 the probability of each one is a 6th okay and the score that you get is well it's this this is the score and we'll multiply all these together and sum them up which will be 1 6th plus 2 6th plus 3 6th plus 4 6th oops plus 5 6th plus 6 6th and that would give you the expected value of that particular thing which is rolling die rolling a die um so that's what expected value means uh alright so that's a really important concept that's going to come up a lot as we read papers um and so in particular this is telling us what are all the things that we're averaging it over that with the expectations over and so there's a whole lot of letters here you're not expected to just know what they are but with paper they could mean totally different things so you have to look immediately underneath where they'll be defined so x0 is an image it's an input image uh epsilon is the noise and the noise has a mean of 0 and a standard deviation of i which if you watch the lesson 9b you'll know it's like a standard deviation of 1 when you're doing multiple normal variables okay this is kind of confusing eta just on its own is a normally distributed random variable so it's just grabbing random numbers but eta theta but epsilon theta is a noise estimator that means it's a function you can tell it's a function kind of because it's got these parentheses and stuff right next to it so that's a function so presumably most functions are these papers and neural networks okay so we're finally at a point where this actually is going to make perfect sense we've got the noise we've got the prediction of that noise we subtract one from the other we square it and we take the expected value so in other words this is mean squared error so wow that's a lot of fiddling around to find out that this whole thing here means mean squared error so the loss function is the mean squared error and unfortunately I don't think the paper ever says that it says minimizing the denoising objective L bloody bloody bloody but anyway we got there eventually fine we also as well as learning about x naught we also learn here about x t and so x t is the original unnoised image times some number plus some noise times one minus that number okay and so hopefully you'll recognise this from lesson 9B this is the thing where we reduce the value of each pixel and we add noise to each pixel so that's that alright so I'm not going to keep going through it but you can kind of basically get the idea here is that once you know what you're looking for the equations do actually make sense but all this is doing is remember this is background this is telling you what already exists this is telling you this is what a ddpm is and then it tells you what a ddim is ddim is just think of it as a more recent version of ddpm it's some very minor changes to the way it's set up which allows us to go faster okay so the thing is though once we keep reading what you'll find is none of this background actually matters but I thought we'd kind of go through it just to get a sense of what's in a paper okay so for the purpose of our background it's enough to know that ddpm and ddim are kind of the foundational papers on which diffusion models today are based okay so the encoding process which encodes an image onto a latent variable okay and then this is basically adding noise this is called ddim encoding and the thing that goes from the input image to the noise image is what we're going to call capital ER and R is the encoding ratio that's going to be some like how much noise are we adding if you use small steps then decoding that so going backwards gives you back the original image okay so that's one of the stuff that we've learned about that's what diffusion models are alright so this looks like a very useful picture so maybe let's take a look so what is diff edit diff edit has three steps step one we add noise to the input image that sounds pretty normal, here's our input image x0 okay and we add noise to it fine and then we denoise it okay fine R but we denoise it twice one time we denoise it using the reference text R horse or this special symbol here means nothing at all so either unconditional or horse alright so we do it once using the word horse so we take this and we decode it, estimate the noise and then we can remove that noise on the assumption that it's a horse then we do it again but the second time we do that noise when we calculate the noise we pass in our query Q which is zebra wow those are going to be very different noises the noise for horse is just going to be literally these Gaussian pixels, these are all dots right because it is a horse but if the claim is no no this is actually a zebra then all of these pixels here are all wrong they're all the wrong colour so the noise that's calculated if we say this is our query is going to be totally different to the noise if we say this is our query and so then we just take one minus the other and here it is here so we derive a mask based on the difference in the denoising results and then you take that and binarise it so basically turn that into 1s and 0s actually the key idea that's a really cool idea which is that once you have a diffusion model that's trained you can do inference on it where you tell it the truth about what the thing is and then you can do it again but lie about what the thing is and in your lying version it's going to say okay all the stuff that doesn't match zebra must be noise and so the difference between the noise prediction when you say hey it's a zebra versus the noise prediction when you say hey it's a horse will be all the pixels that it says no these pixels are not zebra the rest of it it's fine there's nothing particularly about the background that wouldn't work with a zebra okay so that's step one so then step two is we take the horse and we add noise to it okay that's this XR thing that we learned about before and then step three we do coding conditioned on the text query using the mask to replace the background with pixel values so this is like the idea that we heard about before which is that during the inference time as you do diffusion from this fuzzy horse what happens is that we do a step of diffusion inference and then all these black pixels we replace with the noise version of the original and so we do that multiple times and so that means that the original pixels in this black area won't get changed and that's why you can see in this picture here and this picture here the background's all the same and the only thing that's changed is that the horse has been turned into a zebra so this paragraph describes it and then you can see here it gives you a lot more detail and the detail often has all kinds of like little tips about things they tried and things they found which is pretty cool so I won't read through all that because it says the same as what I've already just said one of the interesting little things they noted note here actually is that this binarized mask so this difference between the R decoding and the Q decoding tends to be a bit bigger than the actual area where the horse is which you can kind of see with these legs for example and their point is that they actually say that's a good thing because actually often you want to slightly change some of the details around the object so this is actually fine alright so we have a description of what the thing is lots of details there and then here's the bit that I totally skip the bit called theoretical analysis where this is the stuff that people really generally just add to try to get their papers past review you have to have fancy meth and so they're basically proving you can see what it says here insight into why this component yields better editing results than other approaches I'm not sure we particularly care because like it makes perfect sense what they're doing it's intuitive and we can see it works I don't feel like I need it proven to me so I skip over that so then they'll show us their experiments to tell us what data sets they did the experiments on and so then they have metrics with names like LP IPS and CSFID you'll come across FID a lot this is just a version of that basically they're trying to score how good their generated images we don't normally care about that either they care because they need to be able to say you should publish our paper because it has a higher number than the other people that have worked on this area in our case we can just say you know it looks good I like it so an excellent question in the chat from Mikalaj with this only work on things that are relatively similar and I think this is a great point this is where understanding this helps to know what its limitations are going to be and that's exactly right if you can't come up with a mask for the change you want this isn't going to work very well on the whole because the masked areas the pixel is going to be copied so for example if you wanted to change it from you know a bowl of fruits to a bowl of fruits with a bokeh background or like a bowl of fruits with a purple tinged photo of a bowl of fruit if you want the whole colour to change that's not going to work because you're not masking off an area so by understanding the detail here Mikalaj has exactly recognised a limitation or what's this for this is for things where you can just say just change this bit and leave everything else the same alright so there's lots of experiments so yeah for some things you care about the experiments a lot if it's something like classification for generation the main thing you probably want to look at is the actual results and often for whatever reason I guess because this is most people read these electronically the results often you have to zoom in to a lot to be able to see whether they're really good so here's the input image they want to turn this into an English foxhound so here's the thing they're comparing themselves to, SD edit and it's changed the composition quite a lot and their version it hasn't changed it at all it's only changed the dog and ditto here, semi trailer truck SD edits totally changed it and if edit hasn't so you can kind of get a sense of like the authors showing off what they're good at here this is what this technique is effective at doing changing animals and vehicles and so forth it does a very good job of it alright so then there's going to be a conclusion at the end which I find almost never adds anything on top of what we've already read and as you can see it's very short anyway now quite often the appendices are really interesting so don't skip over them often you'll find more examples of pictures they might show some examples of pictures that didn't work very well stuff like that so it's often well worth looking at the appendices often some of the most interesting examples are there and that's it alright so that is I guess our first full on paper walkthrough and it's important to remember this is not like a carefully chosen paper that we've picked specifically because you can handle it this is the most interesting paper that came out this week and so it gives you a sense of what it's really like and for those of you who are ready to try something that's going to stretch you see if you can implement any of this paper so there are three steps the first step is kind of the most interesting one which is to automatically generate a mask and the information that you have and the code that's in the lesson 9 notebook actually contains everything you need to do it so maybe give it a go see if you can mask out a course that does not look like a zebra and that's actually useful of itself that allows you to create segmentation masks automatically so that's pretty cool and then if you get that working then you can go and try and do step 2 if you get that working you can try and do step 3 and this only came out this week so I haven't really seen examples of easy to use interfaces to this so here's an example of a paper that you could be the first person to create a call interface to it so there's a fun little project and even if you're watching this a long time after this was released and everybody's been doing this for years still good homework I think to practice if you can alright I think now's a good time to have a 10 minute break so I'll see you all back here in 10 minutes and pop questions into the forum topic if you've got questions that haven't been answered yet okay let's see what's on the forum so Shashank's got an interesting question can Diff edit be used for changing the background so maybe you or somebody else can have a think about that and tell me or put on the forum or in the YouTube chat what do you think based on what we heard today interesting conversation um I was just had a thought during the break which is it's really unusual really unusual to be doing a course at the cutting edge of a rapidly moving field which is extremely societally significant and to literally be studying a paper which is at the leading edge of that field I think you know there's if you can try and find some time to take advantage of this you could maybe be the first person to write a really clear blog post describing this approach like you could show use a notebook to actually show every step show the examples one of you doesn't do it maybe I'll do it myself it could be much better if one of you did it you could say like oh here's what happens when we calculate the noise using this prompt and here it is using that prompt and let's look at some pixels and you could really show step by step in a notebook in a very code first way I think that'd be amazing or you could like show an example with using the background or you could like write a post about what it can't do by like showing some failure cases or you could maybe come up with some slight variations to experiment with yeah I think there's opportunities here to try to like I think you could really make a name for yourself get a lot of attention to your work by during this course and now's a great time because we've just done three really interesting papers in the last two weeks of really helping people digest them and try things with them and explore them so don't just put it on the forum put it you know get it out there on blogs and most importantly you know make sure you talk about it on Twitter because that's where all the machine learning people are that would be a little suggestion if you've got the time and interest okay back to here the other thing I'd mention is Jono and I and the others talk sometimes about how we find good papers pretty much everybody follows this account if you have a look it's just paper after paper after paper and somehow he finds consistently really interesting papers so if you do nothing else if you're looking for interesting papers follow this account I don't know how he does it okay welcome back one thing during the break that Diego reminded us about which I normally describe and I totally forgot about this time is Detekify which is another really great way to find symbols you don't know about so let's try it for that expectation so if you've got to Detekify and you draw the thing it doesn't always work fantastically well but sometimes it works very nicely yeah in this case not quite what about the double line thing it's good to know all the techniques I guess I guess part of the problem is there's so many options that actually in this case it wasn't particularly helpful normally it's more helpful than that I mean if we use a simple one Epsilon I think it should be fine there's a lot of room to improve this app actually if anybody's interested in a project I think you could make it more successful okay there you go Sigma sum that's cool anyway so it's another useful thing to know about just google for Detekify okay so let's move on from the foundations now and so we were working on trying to at least get the start of a forward pass of a linear model or a simple multi-layer perceptron for MNIST going and we had successfully created a basic tensor we've got some random numbers going so what we now need to do is we now need to be able to multiply these things by matrix multiplication so matrix multiplication to remind you in this case so we're doing MNIST right so we've got I think we're going to use a subset let's see so we're going to create a matrix called M1 which is just the first five digits so M1 will be the first five digits so five rows and and then 780 what is it again 784 columns 784 columns because it's 28 by 28 pixels and we flattened it out so this is our first matrix in our matrix multiplication and then we're going to multiply that by some some weights so the weights are going to be 784 by 10 random numbers so for every one of these 784 pixels each one is going to have a weight so 784 down here 784 by 10 so this first column for example is going to tell us all the weights in order to figure out if something's a zero and the second column will have all the weights in deciding the probability of something's a one and so forth assuming we're just doing a linear model and so then we're going to multiply these two matrices together so when we multiply matrices together we take row one of matrix one and we take column one of matrix two and we take each one in turn so we take this one and we take this one and we multiply them together and then we take this one and this one and we multiply them together and we do that for every element wise pair and then we add them all up and that would give us the value for the very first cell that would go in here that's what matrix multiplication is so let's go ahead and create our random numbers for the weights since we're allowed to use random number generators now and for the bias we'll just use a bunch of zeros to start with so the bias is just what we're going to add to each one and so for our matrix multiplication we're going to be doing a little mini batch here we're going to be doing five rows of as we discussed five rows of so five images flattened out and then multiplied by this weights matrix so here are the shapes m1 is 5 by 784 as we saw m2 was 784 by 10 okay so keep those in mind so here's a handy thing m1.shape contains two numbers and I want to pull them out I want to call the, I'm going to think of that as, I'm going to actually think of this as like a and b rather than m1 and m2 so this is like a and b so the number of rows in a and the number of columns in a if I say equals m1.shape that will put 5 in ar and 784 in ac so you'll probably notice this I do this a lot this destructuring we talked about it last week too so we can do the same for m2.shape put that into b rows and b columns and so now if I write out arac and brbc you can again see the same things from the sizes so that's a good way to kind of give us the stuff we have to loop through so here's our result, so our resultant tensor well we're multiplying we're multiplying together all of these 784 things and adding them up so the resultant tensor is going to be 5 by 10 and then each thing in here is the result of multiplying and adding 784 pairs so the result here is going to start with zeros and this is the result and it's going to contain ar rows 5 rows and bc columns 10 columns, 5, 10 ok so now we have to fill that in and so to do a matrix multiplication we have to first we have to go through each row one at a time and here we have that go through each row one at a time and then go through each column one at a time and then we have to go through each pair in that row column one at a time so there's going to be a loop in a loop in a loop so here's we're going to loop over each row and here we're going to loop over each column and then here we're going to loop so each column is c and then here we're going to loop over A, which is going to be the same as the number of rows of B, which we can see here, AC, 784, BR, 784, they're the same. So it wouldn't matter whether we said AC or BR. So then our result for that row and that column, we have to add onto it the product of IK in the first matrix by KJ in the second matrix. So K is going up through those 784 and so we're going to go across the columns and down, sorry, across the rows and down the columns, it's going to go across the row while it goes down this column. So here is the world's most naive, slow, uninteresting matrix multiplication. And if we run it, okay, it's done something. We have successfully, apparently, successfully, multiplied the matrices M1 and M2. It's a little hard to read this, I find, because punch cards used to be 80 columns wide. We still assume screens are 80 columns wide. In defaults to 80 wide, which is ridiculous, but you can easily change it. So if you say set print options, you can choose your own line width. You can see, well, we know that's 5 by 10, we did it before. So if we change the line width, okay, that's much easier to read now. We can see here the five rows and here are the 10 columns for that matrix multiplication. I tend to always put this at the top of my notebooks and you can do the same thing for NumPy as well. So what I like to do, this is really important, is when I'm working on code, particularly numeric code, I like to do it all step by step in Jupyter. And then what I do is once I've got it working is I copy all the cells that have implemented that and I paste them and then I select them all and I hit shift M to merge. Get rid of anything that prints out stuff I don't need. And then I put a header on the top, give it a function name, and then I select the whole lot and I hit control or apple right square bracket and I've turned it into a function. But I still keep the stuff above it so I can see all the step by step stuff for learning about it later. And so that's what I've done here to create this function. And so this function does exactly the same things we just did. And we can see how long it takes to run by using percent time and it took about half a second, which gosh, that's a long time to generate such a small matrix. This is just to do five M-ness digits. So that's not going to be great. We're going to have to speed that up. I'm actually quite surprised at how slow that is because there's only 39,200. So if you look at how we've got a loop within a loop within a loop, it's going the wrong way, a loop within a loop within a loop, it's doing 39,200 of these. So Python, yeah, Python, when you're just doing Python, it is slow. So we can't do that. That's why we can't just write Python. But there is something that kind of lets us write Python. We could instead use number. Number is a system that takes Python and turns it into basically into machine code. And it's amazingly easy to do. You can basically take a function and write njit at njit on top. And what it's going to do is it's going to look at the first time you call this function, it's going to compile it down to machine code and it will run much more quickly. So what I've done here is I've taken the innermost loop. So just looping through and adding up all these. So start at zero, go through and add up all those just for two vectors and return it. This is called a dot product in linear algebra, so we'll call it dot. And so number only works with NumPy, doesn't work with PyTorch, so we're just going to use arrays instead of tensors for a moment. Now have a look at this. If I try to do a dot product of 123 and 234, it's pretty easy to do. It took a fifth of a second, which sounds terrible. But the reason it took a fifth of a second is because that's actually how long it took to compile this and run it. Now that it's compiled the second time, it just has to call it, it's now 21 microseconds. And so that's actually very fast. So with number, we can basically make Python run at C speed. So now the important thing to recognize is if I replace this loop in Python with a called a dot, which is running in machine code, then we now have one, two loops running in Python, not three. So our 448 milliseconds, well first of all let's make sure if I run it, run that map mole. It should be close to my T1, T1 is what we got before, remember? So when I'm refactoring or performance improving or whatever, I always like to put every step in the notebook and then test. So this test close comes from fastcore.test and it just checks that two things are very similar. They might not be exactly the same because of little floating point differences, which is fine. So our map mole is working correctly, or at least it's doing the same thing it did before. So if we now run it, it's taking 268 microseconds versus 448 milliseconds. So it's taking about 2,000 times faster just by changing the one in a most loop. So really all we've done is we've added an engine to make it 2,000 times faster. So number is well worth knowing about. It can make your Python code very, very fast. Okay, let's keep making it faster. So we're going to use stuff again, which kind of goes back to APL. And a lot of people say that learning APL is a thing that's taught the more about programming than anything else. So it's probably worth considering learning APL. And let's just look at these various things. We've got A is 10, 6 minus 4. So remember in APL we don't say equals. Equals actually means equals, funnily enough. To say set to, we use this arrow. And this is a list of 10, 6, 4, okay. And then B is 287, okay. And we're going to add them up A plus B. So what's going on here? So it's really important that you can think of a symbol like A as representing a tensor or an array. APL calls them arrays. PyTorch calls them tensors. NumPy calls them arrays. They're the same thing. So this is a single thing that contains a bunch of numbers. This is a single thing that contains a bunch of numbers. This is an operation that applies to arrays or tensors. And what it does is it works what's called element-wise. It takes each pair, 10 and 2, and adds them together. Each pair, 6 and 8 adds them together. This is element-wise addition. And Fred's asking in the chat, how do you put in these symbols? If you just mouse over any of them, it will show you how to write it. And the one you want is the one at the very bottom, which is the one where it says prefix. Now, the prefix is the back tick character. So here it's saying prefix hyphen gives us times. So if I type prefix hyphen, there we go. So OK. A back tick dash B is A times B, for example. So yeah, they all have shortcut keys which you learn pretty quickly, I find. And there's a fairly consistent kind of system for those shortcut keys, too. All right, so we can do the same thing in PyTorch. It's a little bit more verbose in PyTorch, which is one reason I often like to do my mathematical fiddling around in APL. I can often do it with less boilerplate, which means I can spend more time thinking. I can see everything on the screen at once. I don't have to spend as much time trying to ignore the tensor, round brackets, square bracket, dot, comma, blah, blah, blah. It's all cognitive load, which I'd rather ignore. But anyway, it does the same thing. So I can say A plus B, and it works exactly like APL. So here's an interesting example. I can go A less than B dot float dot mean. So let's try that one over here. A less than B. So this is a really important idea, which I think was invented by Ken Iverson, the APL guy, which is that true and false are represented by 0 and 1. And because they're represented by 0 and 1, we can do things to them. We can add them up and subtract them and so forth. It's a really important idea. So in this case, I want to take the mean of them. And I'm going to tell you something amazing, which is that in APL, there is no function called mean. Why not? That's because we can write the mean function, which so that's four letters, mean, MEAN, we can write the mean function from scratch with four characters. I'll show you. Here is the whole mean function. We're going to create a function called mean. And the mean is equal to the sum of a list divided by the count of a list. So this here is sum divided by count. And so I have now defined a new function called mean, which calculates the mean. Mean of a is less than b. There we go. And so in practice, I'm not sure people would even bother defining a function called mean because it's just as easy to actually write its implementation in APL, in NumPy, or whatever a Python. It's going to take a lot more than four letters to implement mean. So anyway, it's a math notation. And so being a math notation, we can do a lot with little, which I find helpful because I can see everything going on at once. Anywho, OK, so that's how we do the same thing in PyTorch. And again, you can see that the less than in both cases are operating element-wise. So a is less than b, is saying 10 is less than 2, 6 is less than 8, 4 is less than 7, and gives us back each of those trues and falses at 0s and 1s. And according to the emoji on our YouTube chat, see if its head just exploded as it should. This is why APL is life-changing. OK, let's now go up to higher ranks. So this here is a rank 1 tensor. So a rank 1 tensor means it's a list of things. It's a vector. Where else a rank 2 tensor is like a list of lists. They all have to be the same length lists, or it's like a rectangular bunch of numbers. And we call it, in math, we call it a matrix. So this is how we can create a tensor containing 1, 2, 3, 4, 5, 6, 7, 8, 9. And you can see often what I like to do is I want to print out the thing I just created after I created it. So two ways to do it. You can say, put an enter and then write m. And that's going to do that. Or if you want to put it all on the same line, that works too. You just use a semicolon. Neither one's better than the other. They're just different. So we could do the same thing in APL. Of course, in APL it's going to be much easier. So we're going to define a matrix called m, which is going to be a 3 by 3 tensor containing the numbers from 1 to 9. OK, and there we go. That's done it in APL. A 3 by 3 tensor containing the numbers from 1 to 9. A lot of these ideas from APL you'll find have made their way into other programming languages. For example, if you use go, you might recognize this. This is the iota character. And go uses the word iota. So they spell it out in a somewhat similar way. A lot of these ideas from APL have found themselves into math notation and other languages. It's been around since the late 50s. OK, so here's a bit of fun. We're going to learn about a new thing that looks kind of crazy called Frobenius norm. And we'll use that from time to time as we're doing generative modeling. And here's the definition of a Frobenius norm. It's the sum over all of the rows and columns of a matrix. And we're going to take each one and square it. We're going to add them up. And they're going to take the square root. And so to implement that in PyTorch is as simple as going n times m dot sum dot square root. So this looks like a pretty complicated thing when you kind of look at it at first. It looks like a lot of squiggly business. Or if you said this thing here, you might be like, what on earth is that? Well, now you know it's just square sum square root. So again, we could do the same thing in APL. So let's do so in APL we want the OK, so we're going to create something called SF. Now it's interesting, APL does this a little bit differently. So dot sum by default in PyTorch sums over everything. And if you want to sum over just one dimension, you have to pass in a dimension keyword. For very good reasons, APL is the opposite. It just sums across rows or just down columns. So actually we have to say sum up the flattened out version of the matrix. And to say flattened out, you use comma. So here's sum up the flattened out version of the matrix. OK, so that's our SF. Oh, sorry. And the matrix is meant to be m times m. There you go. So there's the same thing. Sum up the flattened out m by m matrix. And another interesting thing about APL is it always is read right to left. There's no such thing as operator precedence, which makes life a lot easier. OK, and then we take the square root of that. There isn't a square root function. So we have to do to the power of 0.5. And there we go, same thing. All right, you get the idea. Yes, a very interesting question here from Marabu. Are the bars for norm or absolute value? And I like Siva's answer, which is the norm is the same as the absolute value for a scalar. So in this case, you can think of it as absolute value. And it's kind of not needed, because it's being squared anyway. But yes, in this case, the norm, well, in every case for a scalar, the norm is the absolute value, which is kind of a cute discovery when you realize it. So thank you for pointing that out, Siva. All right, so this is just fiddling around a little bit to kind of get a sense of how these things work. So really importantly, you can index into a matrix. And you'll say rows first, and then columns. And if you say colon, it means all the columns. So if I say row two, here it is, row two, all the columns. Sorry, this is row two, it starts at 0. APL starts at 1. All the columns, that's going to be 789. And you can see I often use comma to print out multiple things. And I don't have to say print in Jupyter, it's kind of assumed. And so this is just a quick way of printing out the second row. And then here, every row column two. So here is every row of column two. And here you can see 369. So one thing very useful to recognize is that for tensors of higher rank than one, such as a matrix, any trailing columns are optional. So you see this here, m2. That's the same as m2 comma colon. It's really important to remember. So m2, you can see the result is the same. So that means row two, every column. OK. So now with all that in place, we've got quite an easy way. We don't need a number anymore. We can multiply. So we can get rid of that innermost loop. So we're going to get rid of this loop. Because this is just multiplying together all of the corresponding rows of A with all the corresponding columns of a row of A with all the corresponding rows of a column of B. And so we can just use an element-wise operation for that. So here is the i-th row of A. And here is the j-th column of B. And so those are both, as we've seen, just vectors. And therefore, we can do an element-wise modification of them and then sum them up. And that's the same as a dot product. So that's handy. And so again, we'll do test-close. OK, it's the same. Great. And again, you'll see we kind of did all of our experimenting first to make sure we understood how it all worked and then put it together. And then if we time it, 661 microseconds. OK, it's interesting. It's actually slower than which really shows you how good number is. But it's certainly a hell of a lot better than our 450 milliseconds. But we're using something that's kind of a lot more general now. This is exactly the same as dot as we've discussed. So we could just use torch dot. Torch dot dot, I suppose I should say. And if we run that, OK, a little faster. It's still, interestingly, it's still slower than the number, which is quite amazing, actually. All right, so that one was not exactly a speed-up, but it's kind of more general, which is nice. Now we're going to get something into something really fun, which is broadcasting. And broadcasting is about what if you have arrays with different shapes. So what's a shape? The shape is the number of rows, or the number of rows and columns, or the number of, what would you say, faces, rows and columns, and so forth. So for example, the shape of m is 3 by 3. So what happens if you multiply or add or do operations to tensors of different shapes? Well, there's one very simple one, which is if you've got a rank 1 tensor, the vector, then you can use any operation with a scalar. And it broadcasts that scalar across the tensor. So a is greater than 0 is exactly the same as saying a is greater than tensor 0 comma 0 comma 0. So it's basically copying that across three times. Now it's not literally making a copy in memory, but it's acting as if we had said that. And this is the most simple version of broadcasting. It's broadcasting the 0 across the 10 and the 6 and the negative 4. And APL does exactly the same thing. a is less than 5. So 0, 0, 1. So same idea. OK. So we can do plus with a scalar. And we can do exactly the same thing with higher than rank 1. So 2 times a matrix is just going to be broadcast across all the rows and all the columns. OK, now it gets interesting. So broadcasting dates back to APL. But a really interesting idea is that we can broadcast, not just scalars, but we can broadcast vectors across matrices or broadcast any kind of lower ranked tensor across higher rank tensors or even broadcast together two tensors of the same rank, but different shapes in a really powerful way. And as I was exploring this, I was trying to, I love doing this kind of computer archeology. I was trying to find out where the hell this comes from. And it actually turns out from this email message in 1995 that the idea actually comes from a language that I'd never heard of called URIC, which still apparently exists. Here's URIC. And so URIC talks about broadcasting and conformability. So what happened is this very obscure language has this very powerful idea. And NumPy has happily stolen the idea from URIC that allows us to broadcast together tensors that don't appear to match. So let me give an example. Here's a tensor called C that's a vector. It's a rank one tensor, 102030. And here's a tensor called M, which is a matrix. We've seen this one before. And one of them is shape 3,3. The other is shape 3. And yet we can add them together. Now, what's happened when we added it together? Well, what's happened is 102030 got added to 123. And then 102030 got added to 456. And then 102030 got added to 789. And hopefully you can see this looks quite familiar. Instead of broadcasting a scalar over a higher rank tensor, this is broadcasting a vector across every row of a matrix. And it works both ways. So we can say C plus M gives us exactly the same thing. And so let me explain what's actually happening here. The trick is to know about this somewhat obscure method called expand ads. And what expand ads does is this creates a new thing called T, which contains exactly the same thing as C, but expanded or kind of copied over. So it has the same shape as M. So here's what T looks like. Now, T contains exactly the same thing as C does, but it's got three copies of it now. And you can see we can definitely add T to M because they match shapes, right? So we can say M plus T. We know we can play M plus T because we've already learned that you can do element wise operations on two things that have matching shapes. Now, by the way, this thing T didn't actually create three copies. Check this out. If we call T dot storage, it tells us what's actually in memory. It actually just contains the numbers 102030. But it does a really clever trick. It has a stride of zero across the rows and a size of three comma three. And so what that means is that it acts as if it's a three by three matrix. And each time it goes to the next row, it actually stays exactly where it is. And this idea of strides is the trick which NumPy and PyTorch and so forth use for all kinds of things where you basically can create very efficient ways to do things like expanding or to kind of jump over things and stuff like that. You know, switch between columns and rows, stuff like that. Anyway, the important thing here for us to recognize is that we didn't actually make a copy. This is totally efficient. And it's all gonna be run in C code very fast. So remember, this expand as is critical. This is the thing that will teach you to understand how broadcasting works, which is really important for implementing deep learning algorithms or any kind of linear algebra on any Python system. Because the NumPy rules are used exactly the same in Jax, in TensorFlow, in PyTorch, and so forth. Now I'll show you a little trick which is gonna be very important in a moment. If we take C, which remember is a vector containing 10, 20, 30, and we say dot unsqueez zero, then it changes the shape from three to one comma three. So it changes it from a vector of length three to a matrix of one row by three columns. This will turn out to be very important in a moment. And you can see how it's printed. It's printed out with two square brackets. Now I never use unsqueez because I much prefer doing something more flexible, which is if you index into an axis with a special value none, also known as np.newaccess, it does exactly the same thing. It inserts a new axis here. So here we'll get exactly the same thing, one row by all the columns, three columns. So this is exactly the same as saying unsqueez. So this inserts a new unit axis. This is a unit axis, a single row in this dimension. And this does the same thing. So these are the same. So we could do the same thing and say unsqueez one, which means now we're going to unsqueez into the first dimension. So that means we now have three rows in one column. See the shape here? The shape is inserting a unit axis in position one. Three rows and one column. And so we can do exactly the same thing here. Give us every row and a new unit axis in position one. Same thing. Okay, so those two are exactly the same. So this is how we create a matrix with one row. This is how we create a matrix with one column. None comma colon versus colon comma none. Or unsqueez. We don't have to say, as we've learned before, none comma colon, because do you remember? Trailing colons are optional. So therefore, just see none is also going to give you a row matrix, one row matrix. This is a little trick here. If you say dot, dot, dot, that means all of the dimensions. And so dot, dot, dot comma none will always insert a unit axis at the end, regardless of what rank a tensor is. So yeah, so none and NP new axis mean exactly the same thing. NP new axis is actually a synonym for none. If you've ever used that, I always use none. Because why not? Short and simple. So here's something interesting. If we go C colon comma none, so let's go and check out what C colon comma none looked like. C colon comma none is a column. And if we say expand as M, which is three by three, then it's gonna take that 10, 20, 30 column and replicate it. 10, 20, 30, 10, 20, 30, 10, 20, 30. So we could add, so remember like, well remember, I will explain that. When you say matrix plus C colon comma none, it's basically gonna do this dot expand as for you. So if I want to add this matrix here to M, I don't need to say dot expand as, I just write this, I just write M plus C colon comma none. And so this is exactly the same as doing M plus C, but now rather than adding the vector to each row, it's adding the vector to each column. C plus 10, 20, 30, 10, 20, 30, 10, 20, 30. So that's a really simple way that we now get kind of for free thanks to this really nifty notation, this nifty approach that came from Yorick. So here you can see M plus C none comma colon is adding 10, 20, 30 to each row and M plus C colon comma none is adding 10, 20, 30 to each column. All right, so that's the basic like hand wavy version. So let's look at like what are the rules and how does it work? Okay, so C none comma colon is one by three. C colon comma none is three by one. What happens if we multiply C none comma colon by C colon comma none? Well, it's gonna do, if you think about it, which you definitely should because thinking is very helpful. If we say, so what it's gonna do is it's going to have to expand as, let's see if this works actually. This one, I'm not quite sure if expand as will do this. C none comma colon expand as C colon comma none. What is going on here? Oh, took forever. Okay, so what happens if we go C none comma colon times C colon comma none? So what it's gonna have to do is it's gonna have to take this 10, 20, 30 column vector or three by one matrix and it's gonna have to make it work across each of these rows. So what it does is expands it to be 10, 20, 30, 10, 20, 30, 10, 20, 30. So it's gonna do it just like this and then it's gonna do the same thing for C none comma colon. So that's gonna become three rows of 10, 20, 30. So we're gonna end up with three rows of 10, 20, 30 times three columns of 10, 20, 30, which gives us our answer. And so this is gonna do an outer product. So it's very nifty that you can actually do an outer product without any special functions or anything, just using broadcasting. And it's not just outer products. You can do outer Boolean operations. And this kind of stuff comes up all the time, right? Now remember, you don't need the comma colon, so get rid of it. So this is showing us all the places where it's greater than, it's kind of an outer Boolean if you wanna call it that. So this is super nifty and you can do all kinds of tricks with this because it runs very, very fast. So this is gonna be accelerated in C. So here are the rules. When you operate on two arrays of tensors, NumPy and PyTorch will compare their shapes. So remember the shape, this is a shape. You can tell it's a shape because we said shape. And it goes from right to left. So that's the trailing dimensions. And it checks whether dimensions are compatible. Now they're compatible if they're equal, right? So for example, if we say m times m, then those two shapes are compatible because in each case, it's gonna be three, right? So they're gonna be equal. So if the shape in that dimension is equal, they're compatible. Or if one of them's one, and if one of them's one, then that dimension is broadcast to make it the same size as the other. So that's why the outer product worked. We had a one by three times a three by one. And so this one got copied three times to make it this long. And this one got copied three times to make it this long. Okay, so those are the rules. So the arrays don't have to have the same number of dimensions. So this is an example that comes up all the time. Let's say you've got a 256 by 256 by three array or tensor of RGB values. So you've got an image in other words, a three color image. And you want to normalize it. So you want to scale each color in the image by a different value. So this is how we normalize colors. So one way is you could multiply or divide or whatever, multiply the image by a one dimensional array with three values. So you've got a one D array. So that's just three. Okay. And then the image is 256 by 256 by three. And we go right to left and we check, are they the same? We say, yes, they are. And then we keep going left and we say, are they the same? And if it's missing, we act as if it's one. And if we go keep going, if it's missing, we act as if it's one. So this is gonna be the same as doing one by one by three. And so this is gonna be broadcast. The three, three elements will be broadcast over all 256 by 256 pixels. So this is a super fast and convenient and nice way of normalizing image data with a single expression. And this is exactly how we do it in the fast AI library, in fact. So we can use this to dramatically speed up our matrix multiplication. Let's just grab a single digit just for simplicity. And I really like doing this in Jupyter Notebooks. And if you build Jupyter Notebooks to explain stuff that you've learned in this course or ways that you can apply it, consider doing this for your readers but add a lot more pros. I haven't added pros here because I wanna use my voice. If I was, for example, in our book that we published, it's all written in Notebooks and there's a lot more pros, obviously. But like really, I like to show every example all along the way, using simple as possible. So let's just grab a single digit. So here's the first digit. So it's shape is, it's a 784 long vector. And remember that our weight matrix is 784 by 10. So if we say digit colon common none dot shape, then that is a 784 by one row matrix. Okay, so there's our matrix. And so if we then take that 784 by one and expand as M2, it's gonna be the same shape as our weight matrix. So it's copied our image data for that digit across all of the 10 vectors representing the 10 kind of linear projections we're doing for our linear model. And so that means that we can take the digit colon common none, so 784 by one and multiply it by the weights. And so that's gonna get us back 784 by 10. And so what it's doing, remember, is it's basically looping through each of these 10 784 long vectors. And for each one of them, it's multiplying it by this digit. So that's exactly what we want to do in our matrix multiplication. So originally we had, well, not originally, most recently I should say we had this dot product where we were actually looping over J, which was the columns of B. So we don't have to do that anymore because we can do it all at once by doing exactly what we just did. So we can take the I through and all the columns and add a axis to the end. And then just like we did here, multiply it by B and then dot sum. And so that is again, exactly the same thing. That is another matrix multiplication doing it using broadcasting. Now this is like tricky to get your head around and so if you haven't done this kind of broadcasting before, it's a really good time to pause the video and look carefully at each of these four cells before and understand what did I do there? Why did I do it? What am I showing you? And then experiment with trying to, and so remember that we started with M1, 0. So just like we have here AI, okay? So that's why we've got I comma, comma, colon, comma, none because this digit is actually M1, 0. This is like M1, 0, colon, none. So this line is doing exactly the same thing as this here plus a sum. So let's check if this map mole is the same as it used to be. Yeah, it's still working and the speed of it, okay, not bad. So 137 microseconds. So we've now gone from a time from 500 milliseconds to about 0.1 milliseconds, funnily enough on my, oh actually now I think about it, my MacBook Air is an M2, whereas this Mac Mini is an M1. So that's a little bit slower. So my Air was a bit faster than 0.1 milliseconds. So overall we've got about a 5,000 times speed improvement. So that is pretty exciting and since it's so fast now, there's no need to use a mini-batch anymore. If you remember, we used a mini-batch of five images. But now we can actually use the whole data set because it's so fast. So now we can do the whole data set. There it is. We've now got 50,000 by 10, which is what we want. And so it's taking us only 656 milliseconds now to do the whole data set. So this is actually getting to a point now where we could start to create and train some simple models in a reasonable amount of time. So that's good news. All right. I think that's probably a good time to take a break. We don't have too much more of this to go, but I don't wanna keep you guys up too late. So hopefully you learned something interesting about broadcasting today. I cannot overemphasize how widely useful this is in all deep learning and machine learning code. It comes up all the time. It's basically our number one, most critical kind of foundational operation. So yeah, take your time practicing it and also good luck with your diffusion homework from the first half of the lesson. Thanks for joining us and I'll see you next time.