 Hey everyone, I know it's been a while since I uploaded lots of life changing events have happened over the last few months And I wasn't in the best mental state. I Posted this about a few months ago, too And I just wanted to thank all of you who reached out to me personally in either in the comments or on one of my videos For your words of encouragement. It really means a lot and I really love this community so much So let's keep learning together All right now that we got that out of the way, let's Concentrate on the main focus tonight. What do neural networks really learn? Neural networks have had so much hype surrounding them in the last decade, especially in computer vision Much research has gone into different problems in the field But let's direct our focus to just image classification for now Mostly because it's easy to wrap our heads around and it's well explored For classification problems, we train a model to increase classification accuracy and Over time we've seen convolutional neural networks get deeper and deeper and deeper all with one goal increase the accuracy of classification While this is impressive Significantly less research has gone into what the inner layers of a neural network learn internally during training If we were to visualize layers of a convolutional neural network We would see the initial layers tend to learn about high-level features like edges and shapes The later layers become more abstract and learn more texture and other non-human-interpretal features of the image This makes sense since the output of the initial layers resembles the image So they pick up on more visual cues The later layers would have seen a series of transformations at that point And so we wouldn't know what to make of them from the get-go But even these later layers can be mapped to have learned something other than this non-interpretable gibberish In this video, we're gonna walk through a way to map this meaningless static image to something that's more meaningful to us humans So we better understand what different parts of the network has learned This is going to be a multi-pass explanation We'll first walk through exactly how humans would classify images and what humans learn along the way The second pass would mostly focus on how a neural network would actually learn to classify these images and what they learn along the way And then we'll be explaining certain processes that occur along that path in detail towards the end of the video So pass one So pass one Let's take the problem of classifying an image as a ski resort or not ski resort Here there are six images Can you identify which ones are ski resorts? Yep, all of them Not too tough But how exactly did you identify them? Well, I see a lot of snow I see buildings Sometimes I see trees and mountains Sometimes I see people with skiing equipment And a combination of these indicators tell me that I am looking at a ski resort Now interestingly, a neural network can actually identify these different objects and textures of an image In other words, the network is actually just trained to recognize ski resorts and it will inherently know what snow is Even though it hasn't been explicitly trained to recognize snow Now ain't that fascinating? This is the big picture of what a neural network does And we're now going to walk through how the neural network actually does this exactly in pass two So let's visualize what the neurons have learned when we train them to classify ski resorts To do this, we need to answer two questions First of all, what is a neuron in the context of a convolutional neural network? And second, how do we visualize that neuron? So in a feed forward neural network, this is a neuron and this is a layer For convolutional neural networks, a conv layer would be one of these blocks Each of these conv blocks are made up of many filters that act on the output of the previous block These initial filters could be considered as neurons So let's take an example here Here is the VGG 16 neural network architecture that has five convolutional blocks Each block here is a layer This fifth layer of the VGG architecture has a shape of 14 x 14 x 512 Now this block was created because the previous layer, which was 28 x 28 x 512 Was convolved by a filter with shape 2 x 2 x 512 And the result is a 14 x 14 output We have 512 such filters that act on the initial convolutional layer To get the 512 14 x 14 outputs that make the block right here that you see Let's consider the output of a single neuron for our analysis That is the output of a single filter, that 14 x 14 output Now how do we visualize this neuron? Now that we identified a neuron, we want to visualize what it learns If we were to just look at this 14 x 14 output as an image It might look something like this Pixelated gibberish But to make sense of this, we want to see what effect this has on the original input image Effectively, we would need to mask the image with this tiny output And two steps to accomplish that are Well, first we need to scale this image 14 x 14 to be the same size as the image Which is that 224 x 224 And then we would apply the result to the original image as a feature map We can do the first step with an upsampling technique One such technique is bilinear interpolation And the second part we can run by using an activation filter Once we have the activation mask and we apply it onto the image We can visually see what the filter has actually learned with respect to that image For example, we would be able to determine that this particular filter is a snow detector Now that was past two of the explanation And now I'm going to explain the major processes of bilinear interpolation and that activation filter I talked about Okay, so the situation now is that the fifth layer of the convolutional neural network With a filter produced this 14 x 14 gibberish static output And we want to scale this up to a 224 x 224 size Which is the size of the input image We now do this with a technique called bilinear interpolation Now to explain this process I'm going to take a very simple toy example We have a small 2 x 2 grid of pixels with some grayscale values We want to upsample this to some 5 x 5 image Now interpolation means filling in the missing values The goal here would be to interpolate every one of these missing points using the points that we have So let's fill in the pixelette the second row and second column to get an idea of this process We would do this in two steps The first would be to interpolate linearly along the x axis to get the 1 2 and the 5 2 And then interpolate linearly on the y axis to get the second row and second column 22 We have two steps of linear interpolation Hence it's called a bilinear interpolation Fascinating! So let's do it So the first row second column is one unit from 103 units from 200 So 75% of its value comes from 100 and 25% of its value comes from 200 And so we can interpolate the first row and second column with a value of 125 Similarly, the fifth row and second column is 75% comes from the first 150 And the 25% comes from the last 150 And its value is interpolated as 150 Like this we can interpolate the other values in the x direction but we really don't need them now We now interpolate along the y axis Now 75% of this value comes from 125 and 25% comes from the 150 So the interpolated value is 131.25 And in this way we can interpolate every pixel in this image Now coming back to our original problem, we can perform bilinear interpolation on our 14 x 14 image To fill in the mixing pixel values and make the output the same dimensions as the 224 x 224 image And step one is complete Now that we have a 224 x 224 upsampled filtered output We want to convert it to a mask and apply this on the image And this can be done with the simple activation function The mask is black and white. It has values of either 0 or 1 So we take all the 224 x 224 pixel values, arrange them in ascending order And determine the pixel that is greater than 90% of the other pixels And use this as a threshold If a pixel is in the lower 90% it becomes 0 If it is in the greater 90% it becomes 1 And so we end up with a mask Now we apply this onto the image to see what parts of it are being recognized by the filter And voila So we're able to see semantically what is being detected in the image by this filter But what are we looking at? I can see this filter is looking at a bunch of snow But how do we quantify this and say definitively that this is a snow detector? This is done with a metric called IOU that is intersection over union It's basically a metric that measures overlap We need a data set with an image and the corresponding mask for every single object in the image I'll provide some example data sets in the description We determine the IOU between each of these object masks and the actual mask from our network The object with the largest IOU is the classification that our network produces Like this we showed a bunch of images and determined the object that this filter segments And in this example filter you'll see that it detects snow in an image a lot So we created a snow detector effectively without explicitly training the network to detect snow It just learned to detect snow when looking for ski resorts And this is very meaningful and it shows that these network filters are actually learning something meaningful to humans And interestingly if we take the original ski resort classifier network And we were just to remove this filter and three others that detect like mountain tops, houses and tree tops We would see a significant dip in performance by over 15% That is even if we were to retain 509 of the 512 filters in our convolutional neural network We would see a significant drop of 15% in performance So basically it helps us identify which parts of the network are the most important And it even helps us make decisions on building more efficient networks Well that's a wrap I hope you all enjoyed this video I have some link to references in the description below Stay well and stay safe Bye!