 This here is a good old feed-forward neural network. We've thrown an input from the left, the network then learns some representation in the middle, and it spits out some output values on the right, some prediction. Fundamentally, at every neuron, we just take the weighted sums of all the incoming neurons from the previous layer. We can throw in activation functions in every neuron too. This allows the network to learn more complex patterns in the input, and in this way, the entire input image or audio or text propagates throughout the network. But in 2012, Alice Krusevsky came up with a technique to improve the performance of neural networks. Originally, when you input an image, the information reaches every neuron, so every single weighted edge in the network is influenced by every single input image in the training set. But is this a good thing? You could argue that more samples that contribute to the weight calculation, the better. But we end up with a network where every neuron is trying to learn a representation of everything. So in the end, you end up with neurons that learn something that its neighbors already know and some representations that may be unique are very weak because of the influence of hundreds of thousands of other samples. Krusevsky was able to get the highest performance on image classification with convolutional nets using a simple technique, drop out. Give every neuron a probability of being fired some row. Then flip a biased coin with that probability. If heads, the neuron is left alone. If tails, we turn the neuron off. We flip a coin for every neuron while inputting a sample for training. In math, it's equivalent to sampling from a Bernoulli distribution with a parameter row. So the image only affects selected active neurons every time. The result is that every edge weight is now influenced by different sets of inputs. Hence, they can learn their own representation. This same network with the same set of neurons can now find a better pattern in data without as much overlap from the other neurons. So yeah, drop out, simple enough, right? But even though drop out was introduced originally in convolution neural networks, we see them more in traditional deep learning architectures. Why is this the case? Well, in convolutional networks, the spatial orientation of the input is preserved. Take the convolution operation itself. Convolution is the sum of element-wise products between a sliding window and a filter. The result of convolution operation for a pixel value depends on the value of its neighbors spatially. Let's try applying drop out. For an input image, say we turn some neurons off initially. The result of the convolution operation may not have the entire representation. That's fine. But say we perform a pooling operation or another convolution operation after this. There's a good chance that all the input image information is still transferred to this layer. So drop out isn't even serving its purpose and the model still tends to overfit. To prevent this, researchers at Google Brain came up with a way to regularize convolutional networks such that pixel information doesn't propagate, and this technique is called drop block. Instead of randomly turning off individual neurons in a layer, we randomly turn off a cluster of neighboring neurons in that layer. This would mean that even with the convolution operation, there are some parts of the input that don't propagate in the network. This is the effect we want because now every neuron learns different regions of every sample, so the model can generalize better. Makes sense, right? For those curious more about the mathematical intuition, let's dive deeper into the algorithm. Say we want to perform some convolution operation. The input would be a set of activations from the previous layer. For now, let's say it's a 10 cross 10 grid. If we want to incorporate drop block, the first hyperparameter we need to decide is the size of the region to turn off. This number is less than the size of the grid, obviously. Let's set this number to be five. In other words, if we want to turn on or off a neuron, we turn on or off a five cross five patch with that neuron as the center. We are now going to construct a binary mask. A binary mask is basically a grid equal to the dimensions of the input from the previous layer, but the values can only take on a zero or one. In this mask, these six cross six cells can be the center of a five cross five region. Iterate over every one of these squares and flip a bias coin. If heads, leave it alone. If tails, things get a little interesting. We don't just turn the single neuron off, but we turn off the five cross five region with that neuron as a center. This way, we can create a mask for neurons activated. After applying the mask to the neurons by element wise multiplication, we have spatial regions turned off, and this prevents information flow. Clearly, draw block is parameterized by two variables. The first is the block size, which is the width and height of the region to turn off. We set it to five in our example, and the gamma value, which is the parameter of the Bernoulli distribution, the bias of the coin flip I was talking about. We can then set gamma with this formula. Typically, it's a value between 0.75 and 0.95. The results when applied to different architectures are significant. We see up to a 2 percent improvement in various convolutional network architectures like ResNet and AmoebaNet. Let's take a look at some PyTorch code to implement this. This isn't my own code, but it's a pretty good example to get your hands dirty. This class drop block has two arguments in its constructor. These are the same ones that I mentioned. DropProp is basically gamma, the variable of our Bernoulli distribution. Block size is the size of the block to region to turn off. It was five in our example. The forward method takes an input. It computes gamma using that awesome formula, and it computes the mask. If we consider the 10 cross 10 grid with a five cross five block, we're going to sample from a Bernoulli distribution six times six, which is 36 times per input image to generate the mask. We then apply the mask to switch on or switch off the neurons. The same principle is applied to 3D images as well. DropLock is a very interesting tweak in neural network architectures that actually makes sense intuitively, and should be used more with models where input data exhibits some spatial correlations. I have a few videos on convolutional networks explaining how they work and some training applications in computer vision. So be sure to check them all out. There is always something interesting coming up, and I'm hoping to cover as much as I can. Thanks for stopping by today. Hit that like, smash that subscribe, ring that bell, share the video, whatever with whom so ever, and I'll see you in the next one.