 Okay, so today we're going to continue working on object detection, which means that for every object in a photo in one of 20 classes, we're going to try and figure out what the object is and what its bounding box is such that we can apply that model to a new data set of unlabeled data and add those labels to it. The general approach we're going to use is to start simple and gradually make it more complicated. So we started last week with a simple classifier, the three lines of closed classifier. We then made it slightly more complex to turn it into a bounding box without a classifier. Today we're going to put those two pieces together to make a classifier plus a bounding box. All of these are just for a single object, the largest object. And then from there we'll build up to something which is closer to our final goal. This is the final goal that we're aiming towards. You should go back and make sure that you understand all of these concepts from last week before you move on. If you don't, go back and re-go through the notebooks carefully. I won't read them all to you because you can see them in the video easily enough. Perhaps this is the most important, knowing how to jump around source code in whatever editor you prefer to use. I've also added here a reminder of what you should know from part one of the course because quite often I see questions on the forum asking, basically, why isn't my model working? Why doesn't it start training or having trained? Why doesn't it seem to be any use? And nearly always the answer to the question is, did you print out the inputs to it from a data loader? Did you print out the outputs from it after evaluating them? And normally the answer is no. When I try printing it and it turns out all the inputs are zero, or all of the outputs are negative, or it's like it's really obvious. So that's just what I wanted to remind you about is that you need to know how to do these two things. And if you can't do that, then it's going to be very hard to debug models. And B, if you can do that, but you're not doing it, then it's going to be very hard for you to debug models. You don't debug models by staring at the source code, hoping your error pops out. You debug models by checking all of the intermediate steps, looking at the data, printing it out, plotting its histogram, making sure it makes sense. So we were working through Pascal notebook, and we just quickly zipped through the bounding box of the largest object without a classified part. And there was one bit that I skipped over and said I'd come back to, so let's do that now. Which is to talk about augmentations, data augmentations of the y, the dependent variable. Before I'll do, I'll just mention something pretty awkward in all this, which is I've got here image classifier data continuous equals true. This makes no sense whatsoever. A classifier is anything where the dependent variable is categorical or binomial, as opposed to regression, which is anything where the dependent variable is continuous. And yet this parameter here, continuous equals true, says that the dependent variable is continuous. So this claims to be creating data for a classifier where the dependent is continuous. This is the kind of awkward rough edge that you see when we're kind of at this, like, you know, the edge of the past AI code that's not quite solidified yet. So probably by the time you watch this in the MOOC, this will be sorted out, and this will be called image regressor data or something like that. But, you know, I just wanted to kind of point out this issue, and also because sometimes people are getting confused between regression versus classification, and this is not going to help one bit. Okay, so let's create some data augmentations, right? Now normally when we create data augmentations, we tend to type in, like, transform side on or transform side down, right? But if you look inside the past AI.transforms module, you'll see that they are simply defined as a list. So this thing called transforms basic, which is 10 degree rotations plus 0.05 brightness and contrast. And then side on adds to that random horizontal flips. Or else top down adds to that random dihedral group of asymmetry flips, which basically means if it possible 90 degree rotation optionally with a flip, so eight possibilities. So, like, these are just little shortcuts that I added because they seem to be useful a lot of the time, but you can always create your own list of all augmentations, right? And if you're not sure what augmentations are there, you can obviously check the past AI source. Or if you just start typing random, they all start with random. So you can see them easily enough. So let's take a look at what happens if we create some data augmentations, create a model data object, and let's just go through and rerun the iterator a bunch of times. And we'll do two things. We'll print out the bounding boxes. And so you can see the bounding boxes the same each time. And we will also draw the pictures. So you'll see this lady is, as we would expect, flipping around and spinning around and getting darker and lighter. But the bounding box A is not moving and B is in the wrong spot. So this is the problem with data augmentation when your dependent variable is pixel values or is in some way connected to your independent variable. The two need to be augmented together. And in fact, you can see that from the print out, these numbers are bigger than 224. But these images are of size 224. That's what we requested in this in this transforms. And so it's not even being like scaled or corrupt or anything, right? So we can see that our dependent variable needs to go through all of the same geometric transformations as our independent variable. So to do that, every transformation has an optional transform Y parameter. It takes a transform type enum. The transform type enum has a few options, all of which we'll cover in this course. The co-ord option says that the Y values represent coordinates. And this goes bounding box coordinates. And so therefore if you flip, you need to change the coordinate to represent that flip. Or if you rotate, you need to change the coordinate to represent that rotation. So I can add transform type dot co-ord to all of my augmentations. I also have to add the exact same thing to my transforms from model function, because that's the thing that does the cropping and or zooming and or padding and or resizing. And all of those things need to happen to the dependent variable as well, right? So if we add all of those together and rerun this, you'll see the bounding box changes each time and you'll see it's in the right spot. Now you'll see sometimes it looks a little odd. Like here, why is that bounding box there? And the problem is, this is just a constraint that the information we have, right? The bounding box does not tell us that actually her head isn't way over here in the top left corner, right? But actually if you do a 30 degree rotation and her head was over here in the top left corner, then the new bounding box would go really high, right? So this is actually the correct bounding box based on the information it has available, which is to say, this is how high she might have been. So basically you've got to be careful of not doing two higher rotations with bounding boxes because there's not enough information for them to stay totally accurate, that's just the fundamental limitation that the information we're given. If we were doing like polygons or segmentations or whatever, we wouldn't have this problem, okay? So I'm going to do a maximum of three degree rotations to avoid that problem. I'm also going to only rotate half the time, I'm going to add my random flip, and I'm going to have my brightness and contrast changing, and so there's my set of transformations that I can use. So we briefly looked at this custom head idea, but basically if you look at dot summary, dot summary does something pretty cool, which is it basically runs a small batch of data through a model and prints out how big it is at every layer. And we can see that at the end of the convolutional section before we hit the flatten, it's 512 by 7 by 7, okay? And so 512 by 7 by 7, a tensor, rank 3 tensor of that size, if we flatten it out into a single rank 1 tensor, into a vector, it's going to be 25,000 and 98 long, right? So then that's why we had this linear layer, 25,000 to 4, because they're the four bounding boxes, right? So stick that on top of a pre-trained ResNet and trade it for a while, there are the results we got. Okay, so that's where we got to last time. So let's now put those two pieces together so that we can get something that classifies and does bounding boxes. And there are three things that we need to do, basically to train a neural network ever, right? We need to provide data, we need to pick some kind of architecture, and we did it in loss function, okay? So the loss function says anything that gives a lower number here is a better network using this data and this architecture. So we're going to need to create those three things for our classification plus bounding box regression. So that means we need a model data object which has, as the independence, the images. And as the dependence, I want to have a tuple. The first element of the tuple should be the bounding box coordinates. And the second element of the tuple should be the class, okay? There's lots of different ways you could do this. The particularly lazy and convenient way I came up with was to create two model data objects representing the two different dependent variables I want. So one with the bounding box coordinates, one with the classes. Just using the CSBs we built them for. And now I'm going to merge them together. So I create a new data set class, and a data set class is anything which has a length and an indexer. So something that lets you use it in square brackets like a list. And so in this case, I can have a constructor which takes an existing data set, so that's going to have both independent and dependent. And the second dependent that I want, the length then is just obviously the length of the data set, the first data set. And then get item is grab the x and the y from the data set that I passed in. And return that x and that y and the ith value of the second dependent variable I passed in, right? So there's a data set that basically adds in a second dependent variable. As I said, there's lots of ways you could do this. But it's kind of convenient because now what I could do is I can create our training data set and the validation data set based on that. So here's an example, you can see it's got a tuple of the bounding box coordinates in the class. We can then take the existing training and validation data loaders and actually replace their data sets with these. And done, done, okay? So we can now test it by grabbing a mini batch of data and checking that we have something that makes sense. So there's one way to customize a data set. So what we're going to do this time now is we've got the data. So now we need an architecture. So the architecture is going to be the same as the architectures that we used for the classifier and for the bounding box regression. But we're just going to combine them. So in other words, if there are C classes, then the number of activations we need in the final layer is four plus C. We've got the four bounding box coordinates and the C probabilities one per class. So this is the final layer, a linear layer that has four plus length of categories activations. The first layer is before is a flatten. We could just join those up together. But in general, I want my custom head to hopefully be capable of solving the problem that I give it on its own if the pre-trained backbone it's connected to is appropriate. And so in this case, I'm thinking, okay, I'm trying to do quite a bit here, two different things, the classifier and bounding box regression. So just a single linear layer doesn't sound like enough. So I put in a second linear layer. And so you can see we basically go value, drop out, linear, value, batch norm, drop out, linear. If you're wondering why there's no batch norm back here, I checked the ResNet backbone. It already has a batch norm as its final layer, so I don't need one. So this is basically nearly the same custom head as before. It's just it's got two linear layers rather than one and the appropriate nonlinearities. So that's piece two, we've got theta, we've got architecture. Now we need a loss function. So the loss function needs to look at these four plus C activations and decide are they good, right? Now these numbers accurately reflecting the position and class of the largest object in this image. We know how to do that. For the first four, we use L1 loss, just like we did in the bounding box regression before. Remember L1 loss is like mean squared error. Rather than sum of squares, it's sum of absolute values. And then for the rest of the activations, we can use cross entropy loss. So let's go ahead and do that. So we're going to create something called detection loss. And loss functions always take an input and a target. That's what PyTorch always calls them. So this is the activations, this is the ground truth. So remember that our custom data set returns a tuple, containing the bounding box coordinates in the classes. So we can destructure that, use destructing assignment to grab the bounding boxes. And the classes of the target, okay? And then the bounding boxes and the classes of the input are simply the first four elements of the input and the four onwards elements of the input. And remember, we've also got a batch dimension that we need to grab the whole thing, okay? So that's it. We've now got the bounding box target, bounding box input, class target, class input. For the bounding boxes, we know that they're going to be between zero and two, two, four, the coordinates, because that's how big our image is, right? So let's grab a sigmoid to force it between zero and one, multiply it by two, two, four, and that's just helping our neural net get close to what we, be in the range we know it has to be. As a general rule, is it better to put batch norm before or after a value? I would suggest that you should put it after a value. Because batch norm is meant to move towards a zero and one random variable. And if you put value after it, then you're truncating it at zero. So there's no way to create negative numbers. But if you put value and then batch norm, it does have that ability. Having said that, and I think that that way of doing it gives slightly better results. Having said that, it's not too big a deal either way. And you'll see during this part of the course, most of the time I go value and then batch norm. But sometimes I go batch norm and then value if I'm trying to be consistent with a paper or something like that. So I think originally the batch norm was put after the attribution. So there's still people who do that. Okay, so this is kind of to help our data or force our data into the right range. Which if you can do stuff like that, it makes it easier to train. Yes, Rachel? One more question. What's the intuition behind using drop out with P equals 0.5 after a batch norm? Doesn't batch norm already do a good job of regularizing? Batch norm doesn't okay job of regularizing. But if you think back to part one, we kind of had that list of things we do to avoid overpitting and adding batch norm is one of them as is data augmentation. But it's perfectly possible that you'll still be overpitting. So one nice thing about drop out is that it has a parameter to say how much to drop out and so that parameters are great. Like, well specifically parameters that decide how much to regularize are great. Because it lets you build a, I speak over parameterized model and then decide how much to regularize it. So yeah, I tend to always include drop out and then if it turns out I'll start with P equals 0 and then as I need to add regularization, I can just change my drop out parameter without worrying about if I saved a model, I want to be able to load it back. But if I had drop out layers in one and not in another or load new also this way it stays consistent. Okay, so now that I've got my inputs and targets, I can just go, hey, calculate the L1 loss and add to it the cross entropy. So that's our loss function, it's surprisingly easy perhaps. Now of course the cross entropy and the L1 loss may be of wildly different scales. In which case in the loss function, the larger one is going to dominate. And so I just ran this in a debugger, checked what you can just use print. Check how big each of the two things were and found if they model pi by 20, that makes them about the same scale. As your training, it's nice to print out information as you go. So I also grabbed the L1 part of this and put it in a function, and I also created a function for accuracy. So that I can then make their metrics and so that we print it out as it goes, right? So we've now got something which is printing out our object detection loss, detection accuracy and detection L1. And so we train it for a while, and it's looking good. Our detection accuracy is in the low 80s, which is the same as what it was before. That doesn't surprise me because ResNet was designed to do classification. So I wouldn't expect us to be able to improve things in such a simple way. But it certainly wasn't designed to do bounding loss regression. It was explicitly actually designed in such a way as to kind of not care about geometry, right? It takes that last seven by seven grid of activations and averages them all together. It throws away all of the information about what it was getting from. So you can see that when we only trained the last layer, the detection L1 is pretty bad, 24, and it really improves a lot, right? Where else the accuracy doesn't improve, is just exactly the same. Interestingly, the L1, when we do accuracy and bounding loss at the same time, at 8.5, seems like it's a little bit better than when we just do bounding loss regression. And if that's counterintuitive to you, then that's what would be one of the main things to think about after this lesson, so it's a really important idea. And the idea is this, figuring out what the main object in an image is, is kind of the hard part. And then figuring out exactly where the bounding box is and what class it is is kind of the easy part in a way. And so when you've got a single network that's both saying what is the object and where is the object, it's going to share all of the computation about finding the object, right? And so all that shared information, all that shared computation is very efficient. So when we back propagate the errors in the class and in the place, that's all information that's going to help the computation around finding the biggest object, right? So anytime you've got multiple tasks, which kind of share some concept of what those tasks would need to do to complete their work, it's very likely they should share at least some layers of the network together, okay? And we'll look later today at a place where most of the layers are shared, but just the last one isn't, we'll see about that. Okay, so you can see this is doing a good job as before. Any time there's just a single major object, sometimes it's getting a little confused. It thinks the main object here is the dog and it's kind of circled the dog, although it's kind of recognized that actually the main object is a sofa. And so the classifier is doing the right thing with the bounding box is labeling the wrong thing, which is kind of curious. When there are two birds, it can only pick one, so it's just kind of hedging in the middle, ditto and there's lots of cows and so forth doing a good job with this kind of, all right. So that's, so that's that, all right? There's not much new there, although in that last bit, we did learn about some simple custom data sets and simple custom lots functions, hopefully you can see now how easy it is to do. So the next stage for me would be to do multi-label classification. So this is this idea that I just want to keep building models that are slightly more complex than the last model, but hopefully don't require too much extra concepts so I can kind of keep seeing things working. And if something stops working, I know exactly where it broke. I'm not going to try and build everything at the same time. So multi-label classification is so easy, it's not much to mention. So we've moved to Pascal multi, now this is where we're going to do the multi-object stuff. So for the multi-object stuff, I've just copied and pasted the functions from the previous notebook that we used. So they're all at the top. So we can create now a multi-class CSB file using the same basic approach that we did last time. And I'll mention, by the way, one of our students actually who's visiting from India, Pani, pointed out to me that all this stuff we're doing with default dicks and stuff like that, he actually showed a way of doing it, which was much simpler using pandas and he shared that on the forum. So I totally bow to his much better approach, a much simpler, more concise approach and yeah, it's definitely true. Like the more you get to know pandas, the more often you realize it's a good way to solve lots of different problems. So definitely check that out. When you're building up the smaller models and you're iterating, do you reuse those models as pre-trained weights for this larger one, or do you just toss it all away and then retrain from scratch? When I'm kind of like figuring stuff out as I go like this, I would generally lean towards tossing away because the kind of reusing pre-trained weights introduces complexities that I'm not really to think about. However, if I'm trying to get to a point where I can run something on really big images, I'll generally start on much smaller ones and often I will reuse those weights. Okay, so in this case, all we're doing is we're just joining up all of the classes with a space, which gives us a CSV in the normal format. And once you've got the CSV in a normal format, it's the usual three lines of code. And we train it. And we print out the results. So there's literally nothing to show you there. And as you can see, it's done a great job. The only mistake I think it made was it called this dog, where I should have been dog and sofa. I think everything else is correct. Okay, so multi-class classification is pretty straightforward. One minor tweak here is to note that I used a set here because I don't want to list all of the objects. I only want each object type to appear once. And so the set class is a way of duplicating a list. So that's why I don't have person, person, person, person, person. Just appears once. So yeah, these object classification pre-trained networks, we have a really pretty good at recognizing multiple objects, as long as you only have to mention each one once. So that works pretty well. All right, so we've got this idea that we've got an input image that goes through a connet, which we're just kind of treated as a black box, and it spits out a tensor, basically a vector of size 4 plus c. c is the number of classes. So that's what we've got. And that gives us an object detector for a single object. The largest object in our case. So let's now create one which doesn't find a single object, but that finds 16 objects. Okay, so an obvious way to do that would be to take this last, this is just an add end on linear, right, which has got however many inputs and 4 plus c outputs. We could take that linear layer, and rather than having 4 plus c outputs, we could have 16 times 4 plus c outputs. So it's now spitting out enough things to give us 16 sets of class probabilities and 16 sets of bounding box coordinates. And then we would just need a loss function that would check whether those 16 sets of bounding boxes correctly represented the up to 16 objects that were represented in the image. Now there's a lot of hand-waving about the loss function. We'll go into it later as to what that is, but let's pretend we have one. Assuming we had a reasonable loss function, that's totally going to work. That is an architecture which has the necessary output activations, but with the correct loss function, we should be able to train it to do what we want it to do. But that's just one way to do it. There's a second way we could do it. Rather than having an end on linear, what if instead we took from our ResNet convolutional background, not an end in linear, but instead we added a nn.com2d with stride 2. So the final layer of ResNet gets you a 7 by 7 by 512 result. So this would give us a 4 by 4 by whatever, number of filters result. Maybe for the number of filters, let's say we picked 256. So it would be 4 by 4 by 256 has, well, actually, no, let's change that. Let's not make it 4 by 4 by 256. Better still. Let's do it all in one step. Let's make it 4 by 4 by 4 plus c. Because now we've got a tensor where the number of elements is exactly equal to the number of elements we wanted. So in other words, we could now, this would work too. If we created a loss function that took a 4 by 4 by 4 plus c tensor and mapped it to 16 objects in the image and checked whether each one was correctly represented by those 4 plus c activations, that would work. Like these are two exactly equivalent sets of activations because they've got the same number of elements. They're just reshaped. So it turns out that both of these approaches are actually used. The approach where you basically just spit out one big long vector from a poly-connected linear layer is used by a class of models known as YOLO. Whereas the approach of the convolutional activations is used by models which started with something called SSD or single shot detector. What I will say is that since these things came out at very similar times in late 2015, things have very much moved towards here. So the point where this morning YOLO version 3 came out and is now doing it the SSD way. So that's what we're going to do. We're going to do this. And we're going to learn about why this makes more sense as well. And so the basic idea is this. Let's imagine that on top of underneath this we had another conv2d, str2, then we'd have something which was 2 by 2 by again, let's say it's 4 plus c. That's nice and simple. And so basically it's creating a grid that looks something like this. One, two, three, four. So that would be like how the activations are, you know, the geometry of the activations of that second extra convolutional str2 layer. Remember str2 convolution does the same thing to the geometry of the activations as a str1 convolution followed by a max pooling assuming patterns are clear. So let's talk about what we might do here because the basic idea is like we want to kind of say, all right, this top left grid cell is responsible for identifying any object that's in the top left. This one in the top right is responsible for identifying something in the top right, this one bottom left, and this one the bottom right. Okay. So in this case, you can actually see it started. It said, okay, this one is going to try and find the chair. This one, it's actually made a mistake, it should have said table, but there are actually one, two, three chairs here as well. So it makes sense, right? So basically each of these grid cells, if it's going to be told in the loss function, your job is to find the object, you know, the big object that's in that part of the image. So what can you pass them here? So for a multi-label classification, I saw you had a threshold on there, which I guess is a hyperparameter. Is there a way to let it go? We're getting your relative. Let's work through this. Okay. All right. So why do we care about the idea that we would like this convolutional grid cell to be responsible for finding things that were in this part of the image? And the reason is because it's something called the receptive field of that convolutional grid cell. And the basic idea is that through actual convolutional layers, every piece of those tensors has a receptive field, which means which part of the input image was responsible for calculating that cell, right? And like all things in life, the easiest way to see this is with Microsoft Excel. So do you remember our convolutional neural net? And this was endless, and we had the number seven. And it went through a two-channel filter, channel one, channel two, which therefore created a two-channel output. And then the next layer was another convolution. So this tensor is now a 3D tensor, which then creates, say, again, two-channel output. And then after that, we had our max pooling layer. So let's look at this part of this output. And the fact that this is common followed by max pool, let's just pretend it's a strive2.com that's basically the same thing. So let's see where this number 27 came from. So if you've got Excel, you can go formulas, trace precedence. And so you can see this came from these four. Now where did those four come from? Those four came from obviously the convolutional filter channel, the kernels, and from these four parts of con one. Right? Because we've got four things here, each one of which has a three by three filter. And so we have three, three, three, three, and all together makes up four by four. Where did those four come from? Those four came from obviously our filter. And this entire part of the input image. Okay? And what's more, you can see, and it also comes through this whole direction as well. Right? And you can see that these bits in the middle have lots of weights coming out. Right? Where else these bits on the outside only have one weight coming out. So we call this here the receptive field of this activation. Right? But note that the receptive field is not just saying it's this here box, but also that the center of the box has more dependencies. Okay? So this is a critically important concept when it comes to kind of understanding architectures and understanding why components work the way we do. The idea of the receptive field. And there are some great articles. If you just Google for convolution receptive field, you can find lots of terrific articles. I'm sure some of you will write much better ones during the week as well. So that's the basic idea then, right? Is that the receptive field of this convolutional activation is generally centered around this part of the input image. So it should be responsible for finding objects that are here. So that's the architecture. The architecture is that we're going to have a ResNet backbone followed by one or more 2D convolutions. And for now we're just going to do one, right? We're just going to give us a 4 by 4 grid. So let's take a look at that. So here it is. We start with our value and dropout. We then do, let's just start at the output. Well actually let's go through and see what we've got here. This one's not there yet. We start with a stride 1 convolution. And the reason we start with a stride 1 convolution is because that doesn't change the geometry at all. It just lets us add an extra layer of calculations, right? It lets us create, you know, not just a linear layer but now we have like a little mini neural network in our custom here, right? So we start with a stride 1 convolution. And standard conv is just something I defined up here which does convolution value batch norm dropout. Like most research code you see won't define a class like this. Instead they'll write the entire thing again and again and again convolution batch norm dropout. It's like, don't be like that, right? Like that kind of duplicate code leads to errors and leads to poor understanding. And I mentioned that also because this week I released the first draft of the fast AI style guide. And the fast AI style guide is very heavily oriented towards the idea of expository programming which is the idea that programming code should be something that you can use to explain an idea. Ideally as readily as mathematical notation to somebody that understands your coding method. And so the idea actually goes back for a very long way but it was best described in the Turing Award lecture. This is like the Nobel of Computer Science. The Turing Award lecture of 1979 by probably my greatest computer science hero, Ken Iverson. He had been working on it since well before 1964 but 1964 was the first example of this approach to programming. He released something called APL. And then 25 years later he won the Turing Award. He then passed on the bat on to his son, Eric Iverson. And there's been basically 50 or 60 years now of continuous development of this idea of like what does programming look like when it's designed to be a notation a notation as a tool for thought for expository programming. And so I've made a very shoddy attempt at taking some of these ideas and thinking about how can they be applied to Python programming with all the limitations by comparison that Python has. Anyway, so but you know here's a very simple example is that if you write all of these things again and again and again then it really hides the fact that you've got two convolutional layers, one of stride one, one of stride two. So my default for standard comm is stride two. That's where this is stride one, this is stride two. And then at the end, so this the output of this is going to be four by four. I've got a outcome and an outcome is interesting. You can see it's got two separate convolutional layers, each of which is stride one. So it's not changing the geometry of the input. One of them is of length of the number of classes. Just ignore k for now. k is equal to one at this point of the code. So that's not doing anything. So one is equal to the length of the number of classes. One is equal to four. And so this is this idea of rather than having a single comm player that outputs four plus c, let's have two comm players, one of which outputs four, one of which outputs c. And then I will just return them as a list of two items. That's nearly the same thing. It's nearly the same thing as having a single comm player that outputs four plus c. But it lets these layers specialize just a little bit. So like we talked about this idea that when you've got kind of multiple tasks, they can share layers, but they don't have to share all the layers. So in this case, our two tasks, which is find, create a classifier and create downbox regression, share every single layer except the very last one. And so this is going to spit out two separate tensors and activations. One of the classes and one of the bounding box coordinates. Why am I adding one? That's because I'm going to have one more class for background. So if there aren't actually 16 objects to detect or if there isn't an object in this corner represented by this convolutional grid cell, then I want to have to predict background, which means no object there. So that's the entirety of our architecture. It's incredibly simple, right? But the point is now that we have this convolutional layer at the end. One thing I do do is that at the very end, I flatten out the convolution basically because I wrote the loss function to expect a flattened out tensor, but we could totally rewrite it to not do that. I might even try doing that during the week and see which one looks easier to understand. Okay, so we've got our data. We've got our architecture. So now all we need is a loss function. Okay, so the loss function needs to look at each of these 16 sets of activations, each of which you're going to have four bounding box coordinates and C plus one plus probabilities. And aside, are those activations close or far away from the object which is kind of closest to this grid cell in the image? And if nothing's there, then are you predicting background correctly? So that turns out to be very hard to do. Because let's go back to the 2 by 2 example to keep it simple. The loss function actually needs to take each of the objects in the image and match them to one of these convolutional grid cells to say like, this grid cell is responsible for this particular object and this grid cell is responsible for this particular object. So then it can go ahead and say like, okay, how close are the four coordinates and how close are the class probabilities, right? So this is called the matching problem. And in order to explain it, I'm going to show it to you. But what I'm going to do first is I'm going to take a break. Okay, and we're going to come back and understand the matching problem. So during the break, have a think about how would you design a loss function here? How would you design a function which has a lower value if these 16 times 4 plus K activations, you know, somehow better reflect the up to 16 objects which are actually in the ground truth image. And we'll come back at 740. So here's our goal. Our dependent variable basically looks like that. And there's just an extract from our CSV file, except it's independent. And our final convolutional layer is going to be a bunch of numbers which initially is a 4 by 4 by... In this case, I think C is equal to 20, plus we've got one for background, right? So 4 plus 21 equals 26, right? 4 by 4, right? And then we flatten that out into a vector. We flatten that out into a vector. And so basically our goal then is to say, to some particular set of activations that ended up coming out of this model. For some, let's pick some particular dependent variable. We need some function that takes in that and that, right? And where it feeds back a higher number, if these activations aren't a good reflection of the ground truth bounding boxes, or a lower number if it is a good reflection of the ground truth bounding boxes. That's our goal. We need to create that function. And so the general approach to creating that function will be to, first of all, to simplify it down to the 2 by 2 version, will be to, first of all... Well, actually, I'll show you, right? Here's a model I trained earlier. Okay, and let's run through. I've taken the loss function and I've split it line by line so that you can see every line that goes into making it. Okay, so let's grab our validation set, data loader. Grab a batch from it. Turn them into variables so we can stick them into a model. Put the model in evaluation mode. Stick that data into our model to grab a batch of activations. And remember that the final output convolution returned two items, the classes and the bounding boxes. So we can do destructuring assignment to grab the two pieces, the batch of classes and outputs and the batch of bounding box outputs. And so as expected, the batch of class outputs is batch size 64 by 16 grid cells by 21 classes and then 64 by 16 by 4 for the bounding box coordinates. Hopefully that all makes sense. And after class go back and just make sure, if it's not obvious why these are the shapes, make sure you get to the point where you understand why they are. Let's now go back and look at the ground truth. So the ground truth is in this y variable. So let's grab the bounding box part and the class part and put them into these two Python variables and print them out. And so there's our ground truth bounding boxes and there's our ground truth classes. So this image apparently has three objects in it. So let's draw a picture of the three objects and there they are. We already have a show ground truth function. The torch ground truth function simply converts the tensors into numpy. Passes them along so that we can print them out. So here we've got the bounding box coordinates. You'll notice that they've all been scaled to zero between zero and one. Okay, so basically we're treating the image as being like one by one. So these are all relative to the size of the image. There's our three classes and so here they are. Chair is zero, dining table is one and two is sofa. This is not a model, this is the ground truth. Great. Here is our four by four grid cells from our final convolutional letter. So each of these square boxes, different papers call them different things. The three terms you'll hear are anchor boxes, prior boxes, or default boxes. And through this explanation you'll get a sense of what they are, but for now think of them as just these 16 squares. I'm going to stick with the term anchor boxes. These 16 squares are our anchor boxes. So what we're going to do for this loss function is we're going to go through a matching problem where we're going to take every one of these 16 boxes and we're going to see which one of these three ground truth objects has the highest amount of overlap with this square. So to do that we're going to have to have some way of measuring an amount of overlap and there's a standard function for this which is called the jacquard index. And the jacquard index is very simple. I'll do it through example. Let's take this sofa. So if we take this sofa and let's take the jacquard index of this sofa with this grid cell here. What we do is we find the area of their intersection. So here is the area of their intersection. And then we find the area of their union. So here is the area of their union. And then we say take the intersection divided by the union. And so that's jacquard index also known as IOU intersection over union. So if two things overlap by more compared to their total sizes together they have a higher jacquard index. So we're going to go through and find the jacquard overlap for each one of these three objects versus each of these 16 anchor boxes. And so that's going to give us a 3 by 16 matrix for every ground truth object, every anchor box, how much overlap is there. So here are the coordinates of all of our anchor boxes. In this case they're printed as center and height and width. And so here is the amount of overlap between, and as you can see it's 3 by 16. So for each of the three ground truth objects, for each of the 16 anchor boxes, how much do they overlap? So you can see here 0, 1, 2, 3, 4, 5, 6, 7, 8. The 8th anchor box overlaps a little bit with the second ground truth object. So what we could do now is we could take the max of dimension 1, so the max of each row, and that will tell us for each ground truth object what's the maximum amount that it overlaps with some grid cell. And it also tells us, remember PyTorch when you say max returns two things. It says what is the max and what is the index of the max. So for each of these things, the 14th grid cell is the largest, here it is, 14th, is the largest overlap for the first ground truth, 13 for the second, and 11 for the third. So that tells us a pretty good way of assigning each of these ground truth objects to a grid cell, what the match is, is which one is the highest overlap. But we're going to do a second thing. We're also going to look at max over dimension 0, and max over dimension 0 is going to tell us what's the maximum amount of overlap for each grid cell, so across all of the ground truth objects. And so particularly interesting here tells us for every grid cell, 16, what's the index of the ground truth object which overlaps with it the most. 0 is a bit overloaded here, 0 could either mean the amount of overlap is 0, or it could mean its largest overlap is with object index 0. It's going to turn out not to matter, but I just wanted to explain why there's so many 0s here. So there's a function called map to ground truth which I'm not going to worry about, but now it's super simple code, but it's slightly awkward to think about, but basically what it does is it combines these two sets of overlaps in a way described in the SSD paper to assign every anchor box to a ground truth object. And basically the way it assigns it is each of these ones, each of these three, gets assigned in this way, so this object gets assigned to anchor box 14, this one to 13, and this one to 11, and then of the rest of the anchor boxes they get assigned to anything which they have an overlap of at least 0.5 with. If anything that doesn't, which isn't in either of those criteria, i.e. which either isn't a maximum or doesn't have a greater than 0.5 overlap is considered to be a cell which contains a background. So that's all the map to ground truth function does, and so after we go through it you can see now a list of all of the assignments and you can also see anywhere that there's a 0 here it means it was assigned to background, in fact anywhere it's less than 0.5 here it was assigned to background. So you can see those three which are kind of forced assignments that puts a high number in just to make sure that they're assigned. So we can now go ahead and convert those to classes and then we can make sure we just grab those which are at least 0.5 in size, and so finally that allows us to spit out the three classes that are being predicted. We can then put that back into the bounding boxes and so here are what each of those anchor boxes is meant to be predicting. So you can see sofa dining room table chair which makes perfect sense if we go back to here this is meant to be predicting sofa this is sofa this is meant to be predicting dining room table this is meant to be predicting chair and everything else is meant to be predicting background. So that's the matching stage. So once we've done the matching stage we're basically done we can take the activations just grab those which matched that's what this positive indexes are subtract from those the ground truth bounding boxes just for those which matched the positive ones take the absolute value of the difference take the mean of that and that's bell one loss and then for the classifications we can just do cross entropy and then as before we can add them together okay so that's the basic idea there's a few and so this is this is what's going to happen right we're going to end up with 16 recommended you know predicted bounding boxes coming out most of them will be background see all these ones that say bg but from time to time they'll say this is a cow this is potted plant this is car okay if you're wondering like how what does it predict in terms of the bounding box of background the answer is totally ignored right that's why we had this only positive indexes thing here right so if it's background there's no you know sense of like where's the correct bounding box the background is totally meaningless so the only ones where the bounding box makes sense out of all these are the ones that aren't there are some important little tweaks one is that the how do we interpret the activations and so the way we interpret the activations is defined here in activation to bounding box and so basically we grab the activations we stick them through fan and so remember fan is the same as sigmoid s shape except it's scaled to be between negative one and one not between zero okay so it's a basically a sigmoid function that goes between negative one and one and so that forces it to be within that range and we then say okay let's grab the the actual position of the anchor boxes and we will move them around according to the value of the activations divided by two so in other words each each activate each each as predicted bounding box can be moved by up to 50 percent of a grid size from where its default position is and ditto for its height and width it can be up to twice as big or half as big as its default size so so that's one thing is we have to convert the activations into some kind of way of scaling those default anchor box positions another thing is we don't actually use cross entropy we actually use binary cross entropy loss right so remember binary cross entropy loss is what we normally use for multi-label classification like in the in the planet amazon satellite competition each satellite image could have multiple things in it right so if it's got multiple things in it you can't use softmax because softmax kind of really encourages just one thing to have the high number in our case each anchor box can only have one object associated with it so it's it's not for that reason that we're avoiding softmax it's something else which is it's possible for an anchor box to have nothing associated with it so there'd be two ways to handle that this this idea of background one would be to say you know what background is just a class right so let's use softmax right and just treat background as one of the classes that the softmax could could predict a lot of people have done it this way I don't like that though right because that's a really hard thing to ask a neural network to do is basically to say can you tell whether this grid cell doesn't have any of the 20 objects that I'm interested with a jacquard overlap of more than 0.5 that's a really hard thing to put into a single computation on the other hand what if we just had for each class you know is it a motorbike is it a bus is it a person is it a bird is it a dining table right and then it can check each of those and be no no no no no and if it's no to all of them and it's like oh and it's background right so that's that's the way I'm doing it is it's not that we could have multiple true labels but we can have zero true labels and so that's what's going on here we take our target and we do a one-hot embedding with number of classes plus one so at this stage we do have the idea of background for the one-hot embedding but then we remove the last column so the background column's now gone right and so now this vectors either of all zeros basically meaning there's nothing here or it has at most one one and so then we can use binary cross entropy to compare our predictions with that target that is a minor tweak right but like it's the kind of minor tweak that I I want you to think about and understand because it's a really like it makes a it makes a really big difference in practice to your training and it's the kind of thing that you'll see a lot of papers talk about like often when there's some increment over some previous paper it'll be something like this it'll be somebody who realizes like oh trying to predict a background category using a softmax is really hard to do what if we use the binary cross entropy instead you know and so it's kind of like if you understand what this is doing and more importantly why we're doing it that's a really good test of your understanding of the material okay and if you don't that's fine right it just shows you this is something that you need to let me go back and rewatch this part of the video and talk to some of your classmates and if necessary ask from the forum until you understand what are we doing why are we doing it okay so that's what this that's what this binary cross entropy loss loss function is doing so basically in this part of the code we've got this custom loss function we've got the thing that calculates that a card index we've got the thing that converts activations to bounding box we've got the thing that does map to ground truth that we looked at okay and that's it all that's left is the ssd loss function so the ssd loss function this is actually what we set um yeah as our crit as our criterion is ssd loss so what ssd loss does is it loops through each image in the mini batch and it calls ssd one loss so ssd loss for one image so this function is really where it's all happening this is calculating the ssd loss for one image okay so we destructure our bounding box in class and basically there's a what this is doing here actually this is worth mentioning um a lot of code you find out there on the internet doesn't work with mini batches you know it only does like one thing at a time which we don't want so in this case we you know all of this stuff is working it's not exactly on a mini batch at a time it's on a whole bunch of um ground truth objects at a time and the data loader is being fed a mini batch at a time to do all the convolutional layers um because uh we could have different numbers of ground truth objects in each image but a tensor has to be this rectangular shape um fast ai automatically pads it with zeros anything that's not the same length right so it's a thing I fairly recently added but it's super handy they're almost no other libraries do that but that does mean that you then have to make sure that you get rid of those zeros right so you can see here I'm checking to find all of the all of the non-zeros and I'm only keeping those right this is just getting rid of any of the bounding boxes that are actually just padding um yeah okay so get rid of the padding turn the activations bounding boxes do the jacquard do the ground truth this is all the stuff we just went through it's all line by line underneath right um uh check that there's an overlap greater than something around point four or point five different papers use different values for this um find the things that match um put the class uh put the um background class for those uh and then uh finally get the l1 loss for the localization part get the binary cross entropy loss for the classification part return those two pieces and then finally have them together so um that's a lot going on um and it might take a few watches of the video to be looking at the code to fully understand it um but the basic idea now is that we now have the things we need we have the data we have the architecture and we have the loss function so now we've got those three things we can train so do my normal learning rate finder and train for a bit and we get down to 25 and then at the end we can see how we went so obviously this isn't quite what we want i mean in practice we kind of remove the background ones or some threshold but it's like it's on the right track there's a dog in the middle it's got a point three four there's a bird here in the middle of point nine four you know something's working okay um you know i've got a few concerns i don't think it's i don't see anything saying motorcycle here it says bicycle which isn't great um there's nothing for the pot of plant that's big enough um but that's not surprising because all of our anchor boxes were small they were four by four grid right so to go from here to something that's going to be more accurate um all we're going to do is to create way more anchor boxes okay so there's a couple of ways we can create quick question i'm i'm just getting lost in the fact that the anchor boxes in the bounding boxes are how are they not the same isn't that how we wrote the loss i must be missing something anchor boxes are the square the fixed square grid cells these are the anchor boxes they're in an exact specific unmoving location the bounding boxes are these are three things the bounding boxes these 16 things are anchor boxes okay so we're going to create lots more anchor boxes so there's three ways to do that and i've kind of drawn some of them or printed some of them here um one is to create anchor boxes of different sizes and or and aspect ratios so here you can see you know there's a upright rectangle there's a lying down rectangle and there's a square is a question for the multi-label classification why aren't we multiplying the categorical loss by a constant like we did before um that's a great question um because later on it'll turn out we don't need to um so yeah uh so you can see here like there's a square and so i don't know if you can see this but if you look you basically got one two three squares of different sizes and for each of those three squares you've also got a lying down rectangle and an upright rectangle to go with them right so we've got three aspect ratios at three zoom levels that's one we can do we can do this right and this is for the um one by one grid so in other words if we added two more stride two convolutional layers you're going to get to a one by one grid and so this is for the one by one grid um another thing we could do is to um use um more convolutional layers as sources of anchor boxes so as well as our and i've um i've randomly jitted these a little bit so it's easier to see right so as well as our 16 by 16 grid cells you can hear this 20 and 22 all these little grid cells we've also got um two by two grid cells and we've also got the one by one grid cell right so in other words if we add three stride two convolutions to the uh to the end we'll have four by four two by two and one by one sets of grid cells all of which have anchor boxes and then for every one of those we can have all of these different shapes and sizes right so um obviously those two are combined with each other to create lots of anchor boxes and if i try to print that on the screen it's just one big blur of color so i'm not gonna do that um so that's all this code is right it says all right what are all the grid cell sizes i have for the anchor boxes what are all the zoom levels i have for the anchor boxes and what are all the aspect ratios i have for the anchor boxes and the rest of this code then just goes away and creates uh the top left and bottom right corners um inside uh anchor corner and the middle and height and width in anchors okay so that's all this does and you can go through it and print out the anchors and anchor corner so the key the key is to remember this basic idea that we have a vector of ground truth stuff right where that stuff is like sets of four bounding boxes this is this is what we were given it was in the json files right it's the ground truth it's a dependent variable sets of four bounding boxes and for each one also a class right so this is a person in this location this is a dog in this location now that's the ground truth that we're given yes just to clarify each set of four is one box yeah exactly it's top left x y bottom right x y so that's what we printed here right we printed out this is what we call the ground truth there's no model this is what we're told is what we're this is what the answer is meant to be and so remember anytime we train a neural net we have a dependent variable and then we have a neural net some black box neural net that takes some input and spits out some output activations okay and we take those activations and we compare them to the ground truth we calculate a loss we find the derivative of that and adjust the weights according to the derivative times the learning rate okay so the loss is calculated using a loss function something i wanted to say is just i think um one of the challenges with this problem is part of what's going on here is we're having to come up with an architecture that's letting us predict this ground truth like it's not because you can have you know any number of objects in your picture it's not an you know immediately obvious like oh what's the correct architecture that's going to let us predict that sort of ground truth i guess so but i'm going to kind of make this claim as we saw when we looked at the kind of yolo versus ssd that like there are only two possible architectures the last layer is fully connected or the last layer is convolutional and both of them work perfectly well i'm sorry i meant in terms of by creating this idea of anchor boxes and anchor boxes with different locations locations and sizes that's giving you a format that kind of lets you get to the activations you're right like high level it's but you see okay so that's that's really entirely in the loss function not in the architecture like and if we use the yolo architecture where we had a fully connected layer like literally there would be no concept of geometry in it at all right so i would suggest like kind of forgetting the architecture and just like treat it as just a given it's a thing that is spitting out 16 times 4 plus c activations right and then i would say our job is to figure out how to take those 16 times 4 plus c activations and compare them to our ground truth which is like 4 plus it's 4 plus 1 but if it was one hot encoded it would be c and i think that's easier to think about so call it 4 plus c times however many ground truth objects there are for that particular image right so let's call that m right so we need a loss function that can take these two things and spit out a number that says how good are these activations right that's that's what we're trying to do so to do it we need to take each one of these m ground truth objects and decide which set of 4 plus c activations is responsible for that object which one should we could be comparing and saying like yeah it's the right class or not and yeah it's close or not okay and so the way we do that is basically to say okay let's decide the first four the first four plus c activations are going to be responsible for predicting the bounding box of the thing that's closest to the top left and the last four plus c you'll be predicting those the furthest to the bottom right right and kind of everything in between so this is this matching problem and then of course we're not using the yellow approach where we have a single vector we're using the ssd approach where we spit out a convolutional output which means that it's it's not arbitrary as to which we match up but actually we want to match up the set of activations whose receptive field most closely reflects you know as the maximum density from where this real object is right but that's a that's a minor tweak you know I guess like that it that's the easy way to have taught this would have been to start with the yellow approach where it's just like an arbitrary vector and we can decide which activations correspond to which bound truth object as long as it's consistent right it's got to be a consistent rule because like if in the first image the top left object corresponds with the first four plus c activations and then the second image we screw things around and suddenly it's now going with the last four plus c activations the neural net doesn't know what to learn but the neural net needs like the loss function needs to be like some consistent task right which in this case the consistent task is try to make these activations reflect the bounding box in this general area that's basically what this loss function is trying to do is it purely coincident that you know the four by four in the com 2d is the same thing as you're 16 no not at all coincidence it's it's because though that four by four com it's going to give us activations whose receptive field corresponds to those locations in the input image so it's it's it's carefully designed to make that as effective as possible now remember i told you before part two that like the stuff we learn in part two is going to assume that you are extremely comfortable with everything you learn in part one and for a lot of you you might be realizing now maybe i wasn't quite as familiar with the stuff in part one as i first thought and that's fine right but just realize you might just have to go back and really think deeply and experiment more with understanding with like what are the inputs and outputs to each layer in a convolutional network how big are they what are their rank exactly how are they calculated so to really fully understand the idea of a receptive field and like what's a loss function really how does back propagation work exactly like these things all need to be like deeply felt intuitions um if you only get through to practice and once they're all deeply felt intuitions then you can rewatch this video and you'll be like oh i see okay i see that you know these activations just need some way of understanding what task they're being given that is being done by the loss function and the loss function is encoding a task and so the task of the ssd loss function is basically two parts part one is figure out which ground truth object is closest to which grid cell or which anchor box right when we when we started doing this the grid cells of the convolution and the anchor boxes were the same right but now we're starting to introduce the idea that we can have multiple anchor boxes per grid cell okay so this is why it starts to get a little bit more complicated so every ground truth object we have to figure out which anchor boxes are closest to every anchor box we have to decide which ground truth object is responsible for if any right and once we've done that matching it's trivial now we just basically go through and do going back to the single object detection it now it's just this right it's once we've got every ground truth object matched to an anchor box to a set of activations we can basically then say okay what's the cross entropy loss of the categorical part what's the l1 loss of the coordinate part so really it's the matching part which is kind of the I don't know the kind of slightly surprising bit and then this idea of picking those in a way that the convolutional network gives it the best opportunity to calculate that part of the space is then the final cherry on top and this um I'll tell you something else this class is by is by far I think going to be the most conceptually challenging and part of the reason for that is that after this we're going to go and do some different stuff and we're going to come back to it in less than 14 and do it again with some tweaks right and we're going to add in some of the new stuff we learn afterwards so you're going to get like a whole second run through this material um once we add some some extra stuff at the end so we're going to going to revise it as we normally do remember in part one we kind of went through computer vision nlp structured data back to nlp back to computer vision you know so we revised everything from the start at the end it'll be kind of similar so yeah so don't worry if it's a bit challenging at first you'll get there okay so so for every grid cell there can be different sizes we can have different orientations and zooms uh representing different different anchor boxes which are just like uh conceptual ideas that basically every one of these is associated with one set of four plus c activations in our model right so however many of these ground truth boxes we have we need to have that times four plus c activations in the model now that does not mean that each convolutional layer needs that many filters right because remember the four by four convolutional layer already has 16 sets of filters the two by two convolutional layer already has four sets of activations and then finally the one by one has one set of activations so we basically get one plus four plus 16 for free just because that's how a convolution works it calculates things at different locations so we actually only need to know k where k is the number of zooms by the number of aspect ratios uh whereas the grids we're going to get for free through our architecture so let's check out that architecture so the model is nearly identical to what we had before okay but we're going to go uh we're going to have a number of stride two convolutions which is going to take us through to four by four two by two one by one right each stride two convolution halves our grid size in both directions okay and then after we do our first convolution to get to four by four we're going to grab a set of outputs from that because we want to save away the four by four uh grids uh anchors and then once we get to two by two we grab another set of now two by two anchors and then finally we get to one by one and we so we get another set of outputs right so you can see we've got like a whole bunch of these uh outcome this first one we're actually not using so at the end of that we can then concatenate torch dot cat concatenate them all together so we've got the four by four activations the two by two activations the one by one um so that's going to give us the correct number of activations to give us one activation for every um for every uh bounding for every anchor box that we have all right so then we just set our criteria as before to ssd loss um and we go ahead and train right and away we go so um in this case i'm just printing out those things with uh which at least probability of point one and you can see we've got some things look okay some things don't our big objects like bird we've got a box here with a point nine three probability it's looking to be in about the right spot our person's looking pretty hopeful um but our motorbike has nothing at all with the probability of point one our potted plants looking pretty horrible um our buses all the room size um what's going on okay so the what's going on here will tell us a lot about the kind of history of of object detection and so these five papers are the key steps in the history recent modern history of object detection and so they go back to about i think this is maybe 2013 uh this paper called scalable object detection using digital networks this is what basically set everything up and when people refer to the multi box method um they're talking about this paper and this is the basic one that came up with this idea that you can have a loss function that has this matching process and then you can kind of use that to do object detection so everything since that time has been trying to figure out basically how to make this better so in parallel um there's a guy called Russ Gershik who was going down a totally different direction which was he had um these two stage uh processes where the first stage used like classical computer vision approaches to like find kind of edges and changes of gradients and stuff to kind of guess which parts of the image may represent distinct objects and then fit each of those into a convolutional neural network which was basically designed to figure out is that actually the kind of object i'm interested in and so this was the kind of the the called the RCNM and then fast RCNM and there was kind of a hybrid of traditional computer vision and deep learning so um what Russ and his team then did was they basically took this multi box idea and replaced the traditional non-deep learning computer vision part of their two stage process with a conf net so they now have two comp nets one comp net that basically spat out something like this which they call these region proposals you know all of the things that might be objects and then the second part was the same as this earlier work there was basically something to talk each of those fed it into a separate comp net which was designed to classify whether or not that particular thing really isn't interesting object or not at a similar time these two papers came out uh yolo and ssd and both of these did something pretty cool which is they got the same kind of performance as fast RCNM but with one stage okay and so they basically took the multi box idea and they tried to figure out how to deal with this mess that's done and the basic ideas were to use for example clinical hard negative mining where they would like go through and find uh all of the um matches that didn't look that good and throw them away um are some very uh tricky and complex data organization methods all kinds of factory basically but they got it to work um pretty pretty well um but then something really cool happened late last year which is this thing called focal loss this paper focal loss for dense object detection the network architect is called retin retin net where they actually realized why this messy crap wasn't working and i'll describe why this messy crap wasn't working by trying to describe why it is that we can't find the motorway so here's the thing um when we look at this um we have uh three different granularities of convolutional grid four by four two by two one by one um the one by one it's quite likely to have a reasonable overlap with some objects because most photos have some kind of main subject on the other hand in the four by four those 16 grid cells are unlikely most of them are not going to have much of an overlap with anything like in this motorbike case it's going to be skies guys guys guys guys guys guys guys ground ground ground ground ground finally motorbike okay so if somebody was to say to you like um you know 20 buck uh bet uh what do you reckon this little clip is you know and you're not sure you're going to say background because most of the time it is background okay and so here's the thing i understand why we have a four by four grid of receptive fields with one anchor box each to coarsely localize objects in the image but i think i'm missing is why we need multiple receptive fields at different sizes the first version already included 16 receptive fields each with the single anchor box associated with the addition there are now many more anchor boxes to consider is this because you constrained how much a receptive field could move or scale from its original size or is there another reason it's kind of backwards the reason i did the constraining was because i knew i was going to be adding more anchor boxes later but really the reason is that the jacquard overlap between one of those four by four grid cells and you know a picture a single object that takes up most of the image is never going to be 0.5 because like the the the intersections much smaller than the union because the one object was too big so for this general idea to work where we're saying like you're responsible for something that you've got a better than 50 percent overlap with we need anchor boxes which which will on a regular basis have a 50 percent or higher overlap which means we need to have a variety of sizes and shapes and scales um yeah so this is this this all happens this all happens in the last option you know basically the vast majority of the interesting stuff in all of the object detection stuff is the loss function because there is only three things loss function architecture um so the uh this is the focal loss paper um focal loss for dense object detection from august 2017 uh here's ross gershik still doing his stuff timing her you might recognize as being the the resnet guy it's a bit of an all-star cast here um and this the key thing is this very first picture um the blue line is a picture of binary cross entropy loss the x axis is what is the probability or what is the activation what is the probability of the the ground truth class so it's actually a motorbike i i said with point six chance it's a motorbike or it's actually not a motorbike and i said with part point six chance it's not a motorbike so this blue line represents the level of the value of cross entropy loss so you can draw this in excel or python or whatever this is just a simple plot of cross entropy loss so the point is if the answer is uh because remember we're doing binary cross entropy loss if the answer is not a motorbike and i said yeah i think it's not a motorbike i'm point six sure it's not a motorbike this blue line is still at like a loss at about point five right it's it's it's it's there's a lot of it's still pretty bad right so i actually have to keep getting more and more confident that it's not a motorbike so if i want to get my loss down then for all of these things which are actually background i have to be saying like i am sure that's background you know or i'm sure it's not a motorbike or a bus or a person or a dining room table because if i don't say i'm sure it's not any of these things then i still get loss so that's why this doesn't work right this doesn't work because even when it gets to here and it wants to say i think it's a motorbike there's no payoff for it to say so because if it's wrong right and it gets killed and the vast majority of the time it's not anything the vast majority of the time it's background and even if it's not background it's not enough just to say it's not background you've got to say which of the 20 things it is right so for the really big things it's fine because that's the one by one grid you know so it's it generally is a thing and you just have to figure out which thing it is for else for these small ones and generally it's not anything so generally the small ones would just prefer to be like i've got nothing to say no comment right so that's why this is empty and that's why even when we do have a bus right it's using a really big grid cell to say it's a bus because these are the only ones where it's like confident enough to make a call that's something right because the small grid cells it very rarely is something so the trick is to try and find a different loss function instead of binary cross entropy loss that doesn't look like the blue line but looks more like the green or purple line and they actually end up suggesting the purple line right and so it turns out this is cross entropy loss negative log pt focal loss is simply 1 minus pt to the gamma where gamma is some parameter right and they recommend using two times the cross entropy loss right so it's literally just a scaling of it and so that takes you to if you use gamma equals two that takes you to this purple line so now if we say now i'm 0.6 sure that it's not a motorbike then the loss function is like good for you no worries okay so that's what we want to do we want to replace cross entropy loss with focal loss and i mentioned a couple of things about this fantastic paper the first is like the actually the actual contribution of this paper is to add 1 minus p to the gamma to the start of this equation which sounds like nothing right but actually people have been trying to figure out this down problem for years and i'm not even sure that realized it's a problem there's just this assumption that you know object detection is really hard and you have to do all of these complex data augmentations and hard negative mining and blah blah blah to get the damping to work right so a it's like this recognition of like look why are we doing all those things and then this realization of like oh if i do that it goes away it's fixed right so when you come across a paper like this which is like game changing you shouldn't assume that you're going to have to write a hundred thousand lines of code it very often is one line of code or the change of a single constant or adding log to a single place right so let's go down to the bit where it all happens where they describe purple loss and i just wanted to point out a couple of terrific things about this paper the first is here is their definition of cross entropy right and if you're not able to write cross entropy on a piece of paper right now then you need to go back and and study it because we're going to be assuming that you know what it is what it means why it's that what the shape of it looks like cross entropy appears everywhere binary cross entropy and categorical cross entropy and the softmax that appears there most people most of the time we'll see cross entropy written as like an indicator on y times log p plus an indicator on y of y minus y times log y minus p this is like kind of awkward notation often people use like a direct delta function stupid stuff like that or else this um this paper just says you know what it's just a conditional the cross entropy simply is negative log p if y is one negative log minus p otherwise so this is y is one if it's a motor by zero if not in this paper they say one if it's a motor by four negative one if not right we use zero and then they do something which mathematicians never do they refactor right check this out hey what if we replace what if we define a new term called pt which is equal to the probability if y is one or one minus p otherwise if we did that we could now redefine ce as that which is super cool like it's such an obvious thing to do but as soon as you do it all of the other equations get simpler as well because later on the very next paragraph they say hey one way to deal with class imbalance i.e. lots of stuff is background would just be to have a different weighting factor for background versus not so like for class one you know we'll have some number alpha and for class f is zero we'll have one minus alpha but then they're like hey let's define alpha t the same way and so now our cross entropy you know with a weighting factor can be written like this and so then they can write their focal loss with the same concept and then eventually they say hey let's take focal loss and combine it with class weighting like so right so often when you see in a paper huge big equations it's just because mathematicians don't know how to read factor and you'll see like the same pieces are kind of repeated all over the place right very very very often and by the time you've turned it into numpad code suddenly it's super simple so this is a million times better than nearly any other paper so it's a great paper to read to understand how papers should be a terrible paper to read to understand what most papers look like okay so let's try this we're going to use this here now remember negative block p is the cross entropy loss so therefore this is just equal to some number times the cross entropy loss and when I defined the binomial cross entropy loss I don't know if you remember or if you noticed but I had a weight which by default was none right and when you call binary cross entropy with logits the high torch thing you can optionally pass it away that's just something that's modified by everything and if it's none then there's no weight so since we're just putting to multiply cross entropy by something we can just define get weight so here's the entirety the purple loss this is the thing that like suddenly made object detection makes sense right so this was late last year suddenly it got rid of all of the complex messy hattery right and so we do our sigmoid here's our pt here's our w and here you can see one minus pt to the power of gamma right and so we're going to set gamma of 2 alpha is 0.25 if you're wondering why here's another excellent thing about this paper because they tried lots of different values of gamma and alpha and they found that 2 and 0.25 work well consistent okay so there's our new loss function it derives from our bc loss adding a weight to it purple loss other than that there's nothing else to do we can just train our model again okay and so this time things are looking quite a bit better you know we now have motorbike bicycle person motorbike like it's it's actually having a go at finding something here it's still doing a good job with big ones in fact it's looking quite a lot better it's finding quite a few people it's finding a couple of different birds it's looking pretty good right so our last step is for now is to basically figure out how to pull out just the interesting stuff that like let's take this dog in this sofa right how do we pick out our dog and our sofa and the answer is incredibly simple all we're going to do is we're going to is to go through every pair of these bounding boxes and if they overlap by more than some amount say 0.5 using jacquard and they both are predicting the same class we're going to assume they're the same thing and we're just going to pick the one with the higher p value and we just keep doing that repeatedly that's really boring code i actually didn't write it myself i copied it off somebody else somebody else's code non-maximum suppression no reason particularly to go through it but that's all of us right so we can now show the results of the non-maximum suppression and yeah here's the sofa here's the dog here's the bird here's the person this person's cigarette looks like it's like a firework or something i don't know what's going on there um this one it's like it's okay but not great like it's found a person and his bicycle and a person and his bicycle but this bicycle is a bit in the wrong place and this person is a bit in the wrong place um you know you can also see that like some of these smaller things have lower p values than hope like the weather bike which is 0.16 this is same car not bus um so there's something still to fix here right and the trick will be to use something uh called um feature periods okay and that's what we're going to do in lesson 14 um that'll fix this up okay um what i wanted to do in the last few minutes of class was to talk um a little bit more about the papers um and specifically to go back to the ssd paper right so this is single shot multi box detector and when this came out i was very excited because it was kind of you know it and yolo were like you know the first kind of um single pass good quality um object detection methods that that have come along and so i kind of ignored object detection until this time or this uh two pass stuff with uh r cnn and fast r cnn and faster r cnn because there's been this kind of continuous repetition of history in the deep learning world which is things that involve multiple passes of multiple different pieces over time you know particularly where they involve some non deep learning pieces like uh r cnn and fast r cnn did over time they basically always get turned into a single end to end deep learning model so i tend to kind of like ignore them until that happens because that's the point where it's like okay now people have figured out how to show this as a deep learning problem as soon as people do that they generally end up something that's much faster and much more accurate right and so ssd and yolo were really important so here's the ssd paper um let's go down to the key piece which is where they describe the model and let's try and understand it so the model is basically one two three four paragraphs right so um papers are really concise right which means that you kind of need to read them pretty carefully partly though you need to know which bits to read carefully so the bits where they say here we're going to prove the error bounds on this model you can ignore that right because you don't care about proving the error bounds but the bit which says here is what the model is is the bit that you need to read really carefully right so here's the bit called model and so hopefully you'll find we can now read this together and understand it right so ssd is a feedforward complex and it creates a fixed size collection of bounding boxes and scores for the presence of object class instances in those boxes so fixed size right i.e the the convolutional grid times k you know the different aspect ratios and stuff and each one of those has four plus c activations followed by a non-maximum suppression step to take the mass of dump and turn it into you know just a couple of non-overlapping different objects the early layers are based on a standard architecture so we just use resnet right this is pretty standard as you know you can kind of see this consistent theme particularly in kind of how the class di library tries to do things which is like grab a pre-trained network that already does something pull off the end bit stick on a new end bit right so early network layers if we use a standard classifier truncate the classification layers as we always do that happens automatically when we use comm learner uh and we call this the base network some papers call that the backbone i know we call it the backbone um and we then add an auxiliary structure okay so the auxiliary structure which we call the custom head has multi-scale feature mass so we add convolutional layers to the end of this base network and they decrease in size progressively so a bunch of stride two conflicts okay so that allows predictions of detections of multiple scales the grid cells are different sizes right the model is different for each feature layer compared to yolo that operate on a single feature map so yolo as we discussed is one vector whereas we have different conflicts each added feature layer gets you a fixed set of predictions using a bunch of filters right for a filter layer where the grid size is n by n four by four with p channels one fact let's take the previous one uh seven by seven with five channels the basic element is going to be a three by three by p kernel uh which in our case is a three by three by four for the shape offset bit or three by three by c for the score for category that so those are those three those are those two pieces at each of those grid cell locations it's going to produce an output value and the bounding box offsets are measured relative to a default box position which we've been calling an anchor box position relative to the feature map location what we've been calling the grid cell okay as opposed to yolo right which has a poly connected layer and then they go on to describe the default boxes what they are for each feature map cell or what we would say grid cell they tile the feature map in a convolutional manner so the position of each box relative to its grid cell is fixed so hopefully you can see you know we end up with c plus four times k filters if there are k boxes at each location um so these are similar to the anchor boxes described in faster arcana so like if you jump straight in and read a paper like this without knowing like what problem they're solving and why are they solving it and what's the kind of the magnitude so forth those four paragraphs would probably make for most no sense but now that we've gone through you read those four paragraphs and hopefully you're thinking oh that's just what jeremy said only they said a better than jeremy in last words okay um so um so i have the same problem when i started reading the ssd paper and i read those four paragraphs and i don't didn't have before this time much of a background and object protection because i had decided to wait until we didn't use to pass anymore and so i read this and i was like what the hell right and so the trick is to then start reading back over the citations right so for example um and you should go back and read this paper now look here's the matching strategy right and that whole matching strategy that i somehow spent like an hour talking about that's just a paragraph right but it really is right um for each ground truth we select from default boxes based on location aspect ratio and scale we match each ground truth to the default box with the best jacquard overlap and then we match default boxes to anything with jacquard overlap higher than 0.5 that's it right that's the one sentence version and then we've got the loss function which is basically to say um take the average so divide by the number of the uh loss based on uh the classes plus the loss based on localization with some weighting factor right now with focal loss i found i didn't really need the weighting factor anymore they both had about the same scale just a coincidence perhaps uh but in this case as i started reading this i didn't really understand exactly what l and g and all this stuff was but it says well this is derived from the modi box objective so then i went back to the paper that defined modi box and i found in their proposed approach they've also got a section called training objective also known as um loss function and here i can see it's the same notation l g blah blah blah and so this is where i can go back and see the detail and after you read a bunch of papers you'll start to see things very quickly for example when you see these double bars two two you'll realize every time there's mean squared error that's how you write mean squared error right this is actually called the two norm the two norm is just the the sum of squared differences right and then this two up here means normally they take the square root so we just don't do this right so this is just a msc right anytime you see like oh here's a log c and here's a log y minus c you know that's basically the binary cross entropy right so it's like you you're not actually going to have to read every bit of every equation right you are kind of a bit at first right but after a while your brain just like immediately knows basically what's going on and then i say oh i've got a log c up and log one minus c and as expected i should have my x and here's my one minus x okay there's all the pieces there that i would expect to see in the binary cross entropy right um so then having done that that then kind of allowed me okay and then they get combined um with the two pieces and oh there's the multiplier that i expected uh and so now i can kind of come back here and understand what's going on okay so um we're going to be looking at a lot more papers right but maybe this week go through the code and go through the paper right and be like what's what's going on and remember what i did to make it easier for you was i took that loss function i copied it into a cell and then i split it up so that each bit was in a separate cell and then after every cell i either printed or plotted that value right so if i hadn't done that for you you should do it yourself right like if there's no way you can understand these functions without trying putting things in seeing what comes down okay so hopefully this is kind of a good good starting point okay well thanks everybody have a great week and see you next monday