 Say I give you the deep learning book, along with the question, how is convolution equivariant with respect to translation? What would you do to answer this question? Well, one way you can do this is to read the entire book and, assuming you remember everything you've read, try to answer the question. But there's a better way. Since it's a question on convolution, I flip to the chapter on convolution neural networks. Then I find equivalence is one of the properties and read out that page, or at least that part of the page. Which do you think is a faster method? If we read the entire text, like in the first method, answering the question may take us a few weeks. But in the second method, the same can be done within a few minutes. That's a very big difference. Furthermore, our answer while reading the entire book may be more vague as it's based on too much information. What did we do differently here? In the former case, we didn't focus on any part of the book specifically, whereas in the latter case, we focused our attention to the chapter on convolution neural networks and then further focused our attention to the part where the concept of equivalence is explained. This second approach would be the exact thought process many of us humans would take. It's quite intuitive. Given this example scenario, we can now better define attention. Attention mechanisms found in neural networks is somewhat similar to that found in humans. They focus in high resolution on certain parts of the input while the rest of the input is in low resolution or blurred. In this video, I'm going to talk about the attention mechanism applied on image inputs. Let's take a look at visual attention at a higher level. Consider the problem of determining appropriate captions for an input image based on the paper show, tell, and attend. This normally consists of two steps. First is to encode the image in an internal vector representation H using a convolution neural network. And then we decode H into word vectors signifying the captions using a recurrent neural network. The problem with this method is when generating a single word of the caption, the LSTM looks at the entire image representation H every time. This is not very efficient as usually, we generate different words of a caption looking at different and specific parts of an image. To solve this problem, we create n different non-overlapping subregions. Hence, HI would be the internal feature representation used to generate the ith word. It is not necessarily the representation of the ith region of the original image. I'll explain this in a bit. For now, the figure on screen is a high level diagram of attention. When the decoder decides on a caption, for every word it only looks at specific regions of the image, leading to a more accurate description. Now that's good, but how does it exactly decide the region or regions to consider? This is the crux of the attention mechanism. An attention unit considers all subregions and context as its input and it outputs the weighted arithmetic mean of these regions. Arithmetic mean is the inner product of actual values and their probabilities. How are these probabilities and weights determined? They are determined using the context C. Context represents everything the recurrent neural network has output until now. Let's take a closer look at what happens. We have input regions Y from the convolution neural net and the context C from the RNN. These inputs are applied to weights which constitute the learnable parameters of the attention unit. This means the weight vectors update as we get more training data. We apply a TANCH activation so that a very high values tend to have very small differences and be close to 1, and very low values also have very small differences closer to minus 1. This leads to a much smoother choice of regions of interest within each subregion. It is more fine grained so to speak. Note we don't necessarily have to apply a TANCH function. We only need to ensure the regions that we output are relevant to the context. In the simplest form, this similarity can be determined with a simple dot product between the regions Y and the context C. The more similar they are, the higher is the product, hence the output is guaranteed to weight the more relevant region YI higher. The difference of using the simple inner product and TANCH function would be granularity of the output regions of interest. TANCH is more fine grained with less choppy and smoother parts of subregions chosen. Regardless of how they are calculated, these M's are then passed through a softmax function which outputs them as probabilities S. Already we take the inner product of this probability vector S and the subregions Y to get the final output Z of relevant regions of the entire image. Understand the probabilities S correspond to the relevance of the subregions Y given the context C. Now there are two types of attention mechanisms. The first is soft attention and then we have hard attention. The main difference here is that in soft attention, the main relevant region Z consists of different parts of different subregions Y. In hard attention, the main relevant region Z consists of only one of the regions Y. I'll explain them both in detail. The entire mechanism of attention that I described until now is all soft attention. Z has relevant parts of different regions. Soft attention is deterministic. So deterministic what's that? A system is said to be deterministic if the application of an action A on a state S always leads to the same state S prime. A dumb example would be you're at a corner of your room at coordinate 0 0 and you're facing forward. Consider an action A which is moving 5 feet forward. The system is now at a new state with the coordinates 5 0 and still facing forward. No matter how many times you stand at the corner of your room forward facing and walk 5 feet forward you will always end up 5 feet from the door and facing forward. Try it trust me it works. Hence the system is deterministic. Let us apply the same concept to soft attention. Initially we have an image just split into a number of regions Y with an input context C. This is our initial state. On the application of soft attention we end up with a localized image representing the new state S prime. These regions of interest are determined from Z. The ROIs will always be the same regardless of how many times we execute soft attention with the same inputs. This is because we consider all the regions Y anyways to determine Z. Now consider hard attention. Looking at the architecture hard attention is very similar to soft attention. However instead of taking the weighted arithmetic mean of all regions hard attention only considers one region randomly. So hard attention is a stochastic process. Now stochastic. When you hear the word stochastic think about randomness. In such a stochastic process performing an action A on a state S may lead to different states every time. Another example is like in a board game with a dice like snakes and ladders. The initial state is the position of the players. The action is rolling a dice and depending on the role there are multiple possibilities for the next board state. What makes hard attention stochastic is that a region Y I is chosen randomly with a probability S I. This means that the more relevant a region Y I as a whole is relevant to the context then greater is the chance it is chosen for determining the next word of the caption. Using the word captions output until now by the RNN that is H along the current regions of interest in an image determined by the attention mechanism the RNN now tries to predict the next word in the caption. As far as performance is concerned in the paper show attend and tell released by the University of Toronto and University of Montreal in 2016 results vary with the data set. Soften hard attention both performed decently well with hard attention performing slightly better. This is pretty cool right? So where else can we use attention? Attention is not only used for image inputs. For example neural machine translation NMT systems they are used to translate one language to another. Words are fed in a sequence to an encoder one after another and the sentence is terminated by a specific input word or symbol. Once complete the special signal initiates the decoder phase where the translated words are generated. Another cool application would be Microsoft's attention generative adversarial networks or Microsoft's attention GAN that can create images from text through natural language processing. It can perform fine-grained tasks like generating parts of an image from a single word in the description. Another application would be in the paper of teaching machines to read and comprehend. The authors do the same thing I talked about in the beginning of the video. A recurrent neural network takes some text and a question as input and it is made to output an answer. Here's some things to remember. Attention involves focus in high resolution on certain parts of an input while the rest of the input is in low resolution or is blurred. Two types of attention are soft attention and hard attention. Soft attention is deterministic while hard attention is stochastic. Attention can be used for non-image inputs like neural machine translation, attention GANs and answering questions from text. And that's all I have for you now. Hope you guys got some newfound understanding of attention and its applications in this video. I have left a link to the main paper, show, attend, and tell with other papers and blog posts in the description down below. Don't forget to give the video a thumbs up and subscribe for more awesome content. Please subscribe. Please. You did it right, guys.