 Image captioning. Given an image, we want to summarize the image in a phrase or in a sentence. What I'd usually do in my videos on such concepts is we would take some state-of-the-art paper and try to dissect their approach, but I want to do things a little differently this time. Let's try to devise our own approach to image captioning and compare it to the state-of-the-art. The goal of doing this kind of exercise is to get you thinking like an AI researcher so that you can come up with similar architectures for any problem. And in the process, understand that those researchers aren't really superhuman. So we have a problem of image captioning, and we want to determine how to use neural networks to solve it. And we're doing this from scratch using our understanding of neural nets. When you hear the term neural network, what do you think? Perhaps an inner connection of neurons that takes raw data as input, performs some hocus pocus in the middle, and spits out some probability in classification problems or some real values in case of regression problems. This notion of neural nets isn't incorrect. But with that understanding, can you really say what these layers represent? What exactly are these layers? Well, to understand this, it's better to think of neural nets from a more mathematical perspective. They're basically mathematical functions that transform one kind of variable to a variable of another kind. It could be vectors to vectors, as we would see in classification problems, or vectors to scalars, as we would see in regression networks. We treat each of these interconnections seen in every layer as a transformation on the input. So each layer is simply the vector representation of the same input. This is going to be important in some time, so keep this in mind. Let's define the structure of our problem now, identifying the inputs and the outputs. The input to an image captioner is some kind of image, like a matrix or a tensor. The output is a sentence, basically a sequence. A sequence is a set of variables that has some defined ordering to it. Sentences are sequences because one word has to come after another and in that order to have some meaning. I said before that neural nets are mathematical functions that map one kind of variable to a variable of another kind. Now, if one of those variables is a sequence, then we get a recurrent neural network. Or at least, that's the first thing we would think of in such problems. So because of output sequences of the image captioner, you may think of recurrent neural networks coming into play. Like any other network, it helps to think of recurrent nets as mathematical functions that map sequences to vectors, vectors to sequences, and sequences to other sequences. Image captioning would fall under a vector to sequence representation. Sure, the input image isn't exactly a natural vector, but the output sentence is most certainly a sequence. Cool, so we have the last part of our architecture, a recurrent neural network. Let's now take a look at our input. It's an image. To feed this to our neural network, we need it in some form of vector format. First thing that comes to mind is simply flattening the input image, so the matrix or tensor becomes a one-dimensional vector. Now, this works, but this image representation is pretty sparse. A better way to represent an image is through convolutional neural networks. Consider the basic convolutional network architecture. This is the Lynette 5 architecture with the basic convolution, activation, and pooling layers, followed by fully connected layers. If you want intuition on each of these layers, you can check out my series on convolutional nets. But you really don't need to know all that to understand what I'm about to do here. Here, instead of thinking of these layers as some complex transformation, let's do the same thing we did before with the recurrent neural nets and the vanilla neural networks. And think of each of these transformations as some mathematical function. At each layer, we are just chaining functions, performing transformations on the same input. So this layer is the condensed matrix or tensor representation of the image. And what is this layer? It's basically the dense vector representation of the input image. This holds true for any network where we have a sequential flow of information, that is, where all information passes through the layers. So we can pass an image into a CNN to get a dense vector representation of the image. Then we can pass this dense vector to an RNN to generate a sequence. The sentence or phrase that describes the image. So nice, our architecture is now taking some form. But there are some tweaks we can still do. For example, we can take into account the meanings or the semantics of the captions instead of just treating them as raw numbers. So how do we do this? Our recurrent neural nets are typically trained using a mechanism called teacher forcing. That is, the correct labels are used to train the recurrent nets for the next state. This is done to ensure the back propagation through time algorithm doesn't become super expensive. In case you're wondering, the truncated BPTT algorithm is a method of training recurrent neural networks. The output for our case is the words of the caption. Each word can be represented by a one-hot encoded vector. In teacher forcing, we would feed this one-hot vector of the previous word in the caption directly to the next iteration. But let's be smarter about this. Instead of simply feeding the word in the next iteration, we can learn a set of word embeddings, WE. Now, WE is a set of word embedding vectors that incorporates the meaning of a word and the closeness to other words in terms of meaning and semantics. If you are using this on specific types of data, it is good to learn these embeddings during the training of your LSTM network simultaneously. Instead of feeding some word ST in the teeth iteration, we would feed a vector WE ST that incorporates the meaning of the word. In this way, the image captioner has knowledge of language while generating captions. So that's awesome. We now have an architecture that converts an image to a caption using neural networks end-to-end. But you may be thinking, this is great and all, but what is the state-of-the-art image captioner? What is the forefront of AI research? And my answer to this is, well, you're looking at it. This architecture that we talked about just now is the state-of-the-art and is the basis for the paper, show and tell, for image captioning. We show an image to the CNN part and then the RNN part of the architecture generates captions to tell us what the image is about. Hence, show and tell. If you understand everything I just said, you've understood show and tell. So congrats. But can we go beyond this? Let's throw attention in the mix, just because we can. Attention involves focusing on certain parts of an image while generating different words of the caption. This can help create more detailed sentences of an image. So how do we do this? Let's come up with the architecture for attention intuitively. In our previous architecture, we took the dense vector representation of the image from the FC layer for CNN. But attention involves looking at different spatial regions of an image. So it makes sense to get a tensor representation of an image to preserve spatial features. To do this, we can use any of the convolution output layers. Remember, convolution, activation and pooling are just mathematical functions applied to an input. So the output of any of these operations is a tensor representation of the input itself. This tensor has L regions, where each ith region of an image is represented by the vector AI. There are two types of attention that we can perform. Soft attention and hard attention. Soft attention involves constructing a word considering multiple parts of an image to a different degree. Zt is the context vector to generate the teeth word for an image. Think of it as the parts of the image to concentrate on while generating a specific word. Alpha ti is a strength or a probability value that ranges in 0 to 1. Its magnitude is basically the amount the image captioner should focus on the region i to generate the teeth word. The other type of attention is hard attention. Instead of a value that determines how much of the image part to consider, each part of the image is either completely considered or completely disregarded while generating a specific word. Sti takes on a binary value of 0 or 1. If 1, it means the ith region is considered while constructing the teeth word. Otherwise, it isn't. Now, let's get back to our architecture. We changed the show and tell architecture by taking a convolution output instead of the FC output. But we need a vector input to our RNN. We extract a context vector Z using our attention mechanism. In this way, we can generate words considering different parts of an image. This architecture that we have here is now the show attend and tell architecture. And this is considered the forefront of recent research on visual attention. We show the image to the CNN, focus our attention to specific regions of an image, and then tell the caption using the RNN. Show, attend and tell. It's as simple as that. Let's take a look at some code. This here is the show part, where we show our image to the convolution neural network, consisting of a set of convolution, activation, and pooling layers. The attend and tell part is a part of our caption generator class. So it is here that we build the LSTM model. This is done while considering the word embedding as the input, and for every word, getting the set of alphas and context vector. This method build sampler will allow us to generate the caption itself. The results are pretty slick. The image captioner is able to generate meaningful captions for the input images. Note that these results are for soft attention, but it can be modified easily for hard attention. If you want to know more about soft and hard attention, I've made a video on visual attention from a different perspective, so check that out. I hope this video gave you an intuition on how to think about neural networks, allowing you to create them from scratch. AI researchers are humans, too. They just happen to get ideas before anyone else. Thinking about neural networks mathematically really does help, and I hope you can see why. Thanks for watching, and if you like the video, hit that like button, hit that subscribe button, share the video with friends, family, acquaintances, your next-door neighbor, perhaps, and I look forward to your support.