 Hello everyone and welcome to another episode of Code Emporium where we're going to talk about the decoder part of the Transformer Neural Network as we continue our discussion in translation. So I have like the decoder logic over here, it's about 150 lines that I'm probably going to give as much context as I can to each line. And all of this goes back to the Transformer Neural Architecture over here where we have like an encoder, a decoder, and we're trying to work to convert and translate from a language. Well, in this case for this video, it doesn't matter, but it is from like English to Kanada or any other language that you would like. So we would pass in like English sentences through the encoder, we get some contextual representation like in the form of like vectors, which are a set of numbers that represent all the English words of the sentence, we pass it to the decoder. And then the decoder is going to be trained to try to predict one word of the output language at a time in trying to, you know, perform some translation. So for this video, we're going to walk through the entire decoder code. And I'm going to also try to wherever I can just see like where we can draw parallels to the hand drawn decoder architecture that I blew up in the last video so that we get a better sense of like, oh, what are the shapes that we see? And now we can see like what code is used to actually create all of these tensors that we see over here. So let's get to it. So first, we're defining this D model, which is essentially the size of all the vectors internally within the entire transformer neural network. And I define it to be 512. So that means that each word is represented by a 512 dimensional vector. The number of heads here is the number of attention heads that we want when we're performing multi-headed self attention, as well as cross attention. Drop Probe is a parameter that we use for a dropout, which is the randomly turning off of neurons. And this random turning off of neurons in a network helps the network learn along different paths and helps it generalize better. Batch size, this allows computers to process multiple, you know, in this case, sentences together all at once. And this enables faster training. As we're not typically just, you know, in the stochastic gradient descent updates, we would pass some input to the network, generate an output, generate a loss, perform back propagation, update millions of parameters. And this is only after seeing one example. And we have to keep doing this for every single example we see. But with batch gradient descent, we only need to update it, let's say, in this case, I set it to 30. That means only when we see 30 sentences, which we pass in all at once, it's only then that we perform a batch update. And so now you'll see that like the batch updates don't happen very jagged. They have happened much less frequent, but also more smooth. Next we have is max sequence length, which is the maximum number of tokens. In this case, tokens are words that a sentence can have. And I've set this max token to be 200 to be like, okay, the maximum number of words in the English sentence that we can possibly pass is 200. And the same also goes for the output countered a sentence. I've just said it to be the same value, but they could be different values. FF hidden in the entire network, we're gonna have like linear layers that occur. And whenever we do have linear layers, I want the output number of neurons in the hidden layer and that specific feed forward layer going is to be 2048. This is just a hyper parameter that I found in the paper attention is all you need, but you can essentially set this to whatever value you want. It's just for the propagation of information. Next is numb layers. This is going to be the number of encoder layers and decoder layers that we want on repetition. So if we go to this diagram of here, essentially this end, I am sending to five because they're both the same. So essentially what's happening is that we pass some inputs to the encoder. It's essentially going to go out here, but we're gonna have a cascade of these encoders one after the other. So you can imagine just like five of these encoders stacked on top of each other. So inputs go in, it passes through all five encoders, then the output goes to the decoder over here. And then in the meantime, we also have like the Kanada language output or in this case it'll be input to the network. It goes through this decoder and it passes in five times before it eventually generates some output probability and hence predict the next word. Now, why do we have like a number of cascaded layers here? Is to deal with the complexity that language has. So the more complex the patterns could be in the language itself, the more intricate you would want your architecture to be so that it can capture those patterns well. And that's kind of why we use like a multi encoder layer, multi decoder layer transformer neural network architecture here. Now X here is the input that we would see to the decoder, but it is the English sentence. So you can imagine X is going to be just, you know, after we pass some inputs, we get some positional embeddings, we pass through the encoder, it will now be context aware. Every single vector is context aware and those vectors are what X constitutes and we'll be passing it in here layer. Now I have like the almost like some similar thing which is Y that I've generated over here that is batch size cross max sequence length cross D model. So this is essentially also a sense of like the every single word has been encoded into a 512 dimensional vector, but this is merely just the input part. So why is simply like this stage right over here? At this point, I've just generated some random values but they would actually come from some, you know, data set and we would have transformed it for this video. It's a little too much code, but I will show this in subsequent videos for sure. But for now, just to get the use to the entire flow, I've introduced it here. Now this mask that I'm creating over here is actually going to be a look ahead mask. This is required because during training, we pass all the words of like the target sentence at the same time, but we don't have information about every single word in that sentence at the time. And so we would need to mask some inputs so that we can prevent ourselves from cheating and looking ahead. And so we would create a look ahead mask. If you're curious about how that mask looks, it kind of looks like this where we have for every single, this would be a 200 cross 200 tensor where we have, you know, for the first word, it's like saying that we don't need to mask it. Zeros are don't mask, negative infinities are mask. So for the first word, it's allowed to see itself. The second word is allowed to see itself in the word before it. The third word is only allowed to see the stuff that happens before it and so on until the very end. So this kind of prevents cheating from happening. And hence we'll prevent a situation where the decoder performs well during the training time, but terrible during the test time because of data leakage. And then we now pass in all of these values calling the constructor of the decoder class, which we will take a look at. And then we call the models forward pass. And for the forward pass, I've done a lot of print statements over here. So you can kind of see exactly how what was the dimension of every single tensor that occurs and I will provide this code all in the description down below. But in this video, I'm also going to explain why these dimensions are the way that they are. So I've pasted all of the decoder code actually right here in the collaboratory notebook. So this code over here and this code are the same. And so I'll just be using whatever's in the decoder collaboratory notebook. Let's start with the decoder class here. So I've defined a decoder class to have a constructor, which takes in all the parameters that we just mentioned before, as well as a forward pass. Now I'm overriding this torch modules because module itself is a PyTorch extension that provides a lot of the underlying boilerplate code that's required for creating neural networks. This includes like performing back propagation. It performs like memory management. And also managing like whether like tensors are supposed to be on GPU during training time versus on a normal CPU machine during testing time. And it also facilitates model forward passes too. And so because of all of these nice little intricacies that torch provides, I use torch modules over here. And because we use torch modules, we also can now provide a forward pass. And we can do things like call this entire method directly as if it were just a method. This would instantiate in fact the forward call or the forward function. So when we call the forward pass, we have the English sentence, the Kanada sentence, as well as the mask. So let's actually write out all of the shapes of these values so that we can kind of walk through its execution. So X is going to be the English sentence which was going to be 30 cross max sequence length, which is 200 cross 512. And we can say the same thing is going to be for Y as well. Y is also going to be the batch size, which is 30 sentences cross 200, which is the max sequence length cross 512. And then the mask itself, at least for this simplistic case is going to be 200 cross 200, which is max sequence length cross max sequence length. Now we're going to be passing in this into layers. So layers here is a sequence of decoder layers. So essentially, if you look at this figure over here, I've, this here is one decoder layer, but it's repeated end times. And so if this is under a class called decoder layer, we repeat this end times, that will be what self dot layers is essentially. Now, typically you would use something called torch NN sequential that looks kind of like this right over here. But here you can see I'm using sequential decoder. And this is mostly because like, when I want to call the forward pass, if I'd used just like sequential on its own, you cannot pass in more than one parameter. And because I have more than one parameter to pass, like I need to pass in a mask, for example, that's kind of why I've implemented my own sequential decoder because I just can't use this directly. So it extends sequential. And what it'll do is I'm just gonna take this input, but when I'm calling every single decoder layer, I'm going to be passing in all of these parameters, but I get the new value of the Canada outputs. That is the output sentence every single time. And I just keep feeding it back into this module. But the same value of X that is, you know, from our encoder that is the English sentence is just passed in regardless of which decoder layer that we are looking at. And thus sequential decoder, this becomes an array of these decoder layers. Let's just actually take a look at what each decoder layer looks like. So if you scroll up here, you can see also decoder layer, it extends, you know, torch modules. And hence it also has a constructor and a forward method that we can call directly. Now let's again walk through the forward pass and I'll describe each of these as and when they appear. So first of all, we're assigning this here to like this underscore Y and this is going to be used for residual addition. So you can kind of see in the normal paper here, you see that there are these arrows over here. These arrows are essentially skip connections or residual connections. And you would typically see these in very deep neural networks. This is because as like the deeper the network is, you know, as you do a back propagation of your gradients, they become smaller and smaller and smaller. And eventually they might vanish towards like the beginning of your network. And if gradients vanish, that means that your neural network will no longer learn. And so in order to propagate a signal much stronger, we use skip connections. They propagate like the input signal much better in the forward direction. And also like any kind of loss or I should say like the derivative of that loss with respect to the inputs, they will propagate it very well in the backward direction as well. And so there's always gonna be like some gradient update that occurs even for like the earlier layers of the network. And that's kind of why like this architecture uses those skip connections. What we're gonna do is now perform the mass self attention where we take the Kanada sentence and we're also going to have, you know, the decoder mask, which is the same 200 cross 200 mask that I showed before. So this self attention, actually you can see over here is multi head attention. So let's go over to that module, multi head attention, which is right over here. In the multi head attention, you can see that because this is the Kanada sentence in this case, it's going to be 30 cross 500 and no, 30 cross 200 cross 512, that's the dimensions, right? That's batch size, max sequence length and D model. Now we're going to have query key and value. So this QKV layer, if you can see here, it's a linear layer that will basically map this 512 dimensions to the three times 512 dimensions, which is 1,536, I believe that's the multiplication here. So looking at the decoder architecture, it's kind of like this, where we have batch size, max sequence length cross 512 and we create query key and value vectors for each and every single word. And that's why since one word becomes now three vectors, that's why we see 1,536. So the query vector for a word is what am I looking for? The key vector and the value vectors essentially are like, you can consider them as some form of memory or you can also consider them as what do I have to offer? So it's kind of like its own information that it already knows. So coming back to the code, this is now going to be a QKV, which would be 30 cross 200 cross 1,536, as we created query key and value tensors, which are going to be used for attention. But we are also reshaping this because we want to perform a multi-headed self-attention. So we're kind of distributing this across eight heads. And because of this reshape, I'm quite literally going to write this out as 30 cross 200 across the number of heads, which is eight. That's how much we've defined over here. It'll be eight that we passed through cross. If you kind of just do the multiplication, I think it's like 192. That's 192 times eight will be 1,536. And so we've created eight attention heads. And if we look at our decoder, what this, why we're doing this is that these are essentially going to be eight parallel processes that are going to go on, that is each one of these rectangles. There's like eight rectangles over here. And each one of them is going to perform its own attention. And then in the end, we're going to be concatenating them together. So coming back again to the code, this is going to be 30 cross 200 cross eight cross 192. We're just going to be reshaping this. So let's just write this out again. So it's 30 cross. We have the second dimension, which is eight cross. First dimension is 200 cross 192. That's just as an advent of the reshape. Now we'll be chunking this up and to buy the last dimension into three parts over here. So that's 30 cross eight cross 200 cross 64. That's 192 divided by three. And this is Q, but you can also say that the key and the value vectors will also be the same. So I can kind of say this is going to be the key as well. And this is going to be the value vectors here as well. Now we perform scale dot product attention. So this is like the crux of the attention mechanism over here. And we're passing each of these query key and value vectors along with a decoder mask. So let's go to our scaled, let's copy this because we probably want to know what the sizes are for these. In fact, we'll do that for everything over here. We go over to scale dot product attention, which is like this right here. And we'll just copy some of these sizes. And I know the mask was like a 200 cross 200, just to keep this in mind. So DK here is actually just a constant that we are going to be using in a division over here. And this is required to scale the multiplication the matrix multiplication of Q times K. And we require this because we want all of these values to kind of be with like a mean zero and standard deviation of one. And this scaling parameter actually just helps us accomplish that. Now, in fact, we can actually see this in action by like, you know, I'm in my GitHub repository right now where you can kind of check out all of the code for this video and also like the past videos that I've created on transformers over here. And you can see that let's, we want to perform this operation of self attention where we have like a Q query vector and the key vector, which we do like a matrix multiply over here. But now the query and key, the variance is like close to one, but the variance of like the multiplication of these, you can see that it kind of goes off the charts a little bit, like it's five to six times greater. However, if we were to like, you know, scale these values, you can see that the variance now becomes much more tractable. When we perform the multiplication here, we transpose the last layer and the last and the penultimate layer. So that means that we actually make, let's say if this is the key vector, we actually transpose just both of these and we can't just do like a K dot T because otherwise that would transpose like the entire thing. It would be a 64 cross 200 cross eight cross 30, tensor, which is not what we need. We just want to transpose these last two layers. And so if you do a matrix multiplication of query and key, that would be 30 cross eight. Let's do that 30 cross eight cross. And this would be a 200 cross 64 times, you know, 64 cross 200, which would be a 200 cross 200 scale tensor. In this case, it's scaled, but this is just a division. It doesn't change any shape. So this is kind of like the initial parts of how the attention matrix is actually going to look. What we do is if mask is not none, in this case, it is not none. We're going to add this mask and we're adding a mask of 200 cross 200 to this tensor. So even though it is a different shape, PyTorch supports something called broadcasting where the last dimensions here match the last dimensions over here, it's still a valid addition. And it's just going to add the same exact tensor or apply the same exact tensor to every single one of these elements. So there's like 30 times eight, which is like 240. So that's like for every batch, for every head, we're going to apply the same exact mask. And so this mask is now going to end up with, it's going to be 30 cross eight cross 200 cross 200 after the operation of this masking is done. Just note that in this case, this is a look ahead mask, but we could have also added some padding to this same mask. And we would have still ended up with the same dimensional tensor once the entire operation was said and done. Next is that we perform softmax. Now softmax is just like taking exponentials and dividing it by the sum of exponentials, which essentially does not change the dimension of anything here. The attention matrix is still going to be the same dimensional shape. It's just going to do the softmax on the last layer, this last layer over here. So that every single row is now going to add up to one. Now the values here is going to be a matrix multiplication of the attention and the values vector, which is this is the attention and the value tensor looks like this. So this is going to end up with the shape of 30 cross eight cross 200 cross 64, right? And this is just for one attention head and we're going to be returning both of these. So in the self attention case, this attention is going to be how much attention are we paying for every single word on itself? Whereas the values is going to be, well, what are the actual context aware final tensors that we are looking at right now for every single word? So that's for every batch, for every head, for every word, we have a 64 dimensional vector that represents context of that word. And eventually when this logic returns, we copy this over, the will come back all the way back to multi head attention over here, this logical return and both the value, this is the values tensor. And now we're going to, well, reshape this such that we flatten the heads dimension as well as, you know, this 64 over here. And so this would be 30 cross 200. That's max sequence length cross eight times 64, which is 512. And so effectively we are concatenating whatever we found from each and every single one of the eight heads together. And then we have this output that is just going to be passing through a linear layer. And this linear layer actually doesn't even change the dimensions. It's like a feed forward layer that maps like 512 dimensional vector to another 512 dimensional vector itself. And so we have the same shape over here. So I can just copy and paste that over here. And it's this now that is passed out. So interestingly here, because this multi head attention, you can see like it's the same input shape that was input, same that was output. And so it can also be a repeated layer without like disturbing anything like very jarringly. Like, oh, the shapes don't match. You won't be getting like such errors here. So now once that multi head attention is complete, the control is going to now transfer back. So self attention is now complete and the output is this shape. Now we perform dropout. Dropout as I mentioned before is the random turning off of neurons so that the neural network can learn a longer, more generic generalized path better for generalization. And hence it doesn't memorize. So it's the same shape. It's just an operation though. Next is add and layer normalization. So we can add whatever we saw here, which is the same shape essentially. This was the same shape up here, this underscore Y. We add these two together and then we just normalize them with layer normalization. So let's go to norm one. Norm one is layer normalization and we pass a parameter shape over here called parameter shape and we pass 512. So let's actually look at layer normalization for a little bit here. Going up to layer normalization. All right. So we have like this dims over here and this dims is gonna be along which layers do we actually want to perform layer normalization? The entire goal of layer normalization here is for constraining the values of every single layer to be with like a mean zero standard deviation one. And this helps stable training so that we can jump the jumps that happen during like every single training phase and gradient update phase are pretty stable. They're not too erratic. And so this here is going to be, I believe dims is going to take on the value of just the last layer. So it's just gonna be like a list of negative one in this case. That's what dims is. Next we're gonna compute the mean along this last layer. Now the inputs just for clarity are going to be of this shape, which is 30 cross 200 cross 512. That's exactly what we're passing in. So we're taking the mean of all of these elements here that is all the 512 dimensional vector that represents a single word. We're gonna take the mean of that and we say keep dims is true because if you just take the mean of these and you run this operation, it's gonna just return a 30 cross 200 dimensional tensor, but we want a 30 cross 200 cross one dimensional tensor so that we can perform some other operations down here so that there's like the tensor shape dimensions match. So one step at a time, we take the inputs which is the shape will subtract it for every single mean. That means that every single word is going to be having like, there's like a single mean that we have computed and we're going to subtract basically that same mean value over here. So you're gonna end up with the same dimension over here. And so this is going to be the same dimension. That's the variance and standard deviation is simply the square root of that variance, which also doesn't change the shape over here. And the output hence also doesn't really change at all. The only difference is now this Y is going to have much more normalized values where the mean is zero, standard deviation is one and hence allows more stable training. But what you see here too is like, there are some gammas and betas. So gammas and betas are a set of learnable parameters. In this case, there's gonna be like 512 essentially of these that's the parameter shape. So all of these gammas are initialized to one and all the betas are initialized to zero. So there's essentially like 512 gammas and 512 betas that we'll be learning over time. And the goal of these learnable parameters is kind of like if you've seen like batch normalization too, where we have essentially gammas and betas, they help learn along dimensions that are not included in this entire operation of like computing the mean and standard deviation. And so even like, even though we do like some strange normalization along like this 512 dimension, we're trying to make sure that all these parameters become kind of like comparable to each other. And that's kind of why we introduce all these, we introduce like so many gammas and betas. We have like one for every single batch and word that we have like a specific gamma value. And so what we do is we multiply gamma times y that is the value of gamma times every single element in y and just add like beta. So gamma is essentially going to be like a standard deviation. It'll learn to be like a standard deviation across multiple examples. And then beta will also learn to be like a mean across multiple examples so that it can better, so that like this out over here is better comparable across these multiple examples. And after performing this operation, we still end up, you know, it's still the same shape, right? Now we have the same input, same output shape as well. So when we return the control now from layer normalization, we go back to where we go, we go back to decoder layer and the output layer normalization is the same shape, right? And we just perform the same operation here, but instead of mass attention, we now perform cross attention, which is slightly different. Now in this cross attention case, I'm not passing in any mask since like the encode, this is going to be the difference between self-attention and cross attention, which I will mention actually, let's just go over here. So it just becomes more apparent as to why we're not passing a mask for now. So multi-headed, yeah, this is multi-headed cross attention. You'll kind of see like the code is very similar to what we saw for self-attention. However, the main difference you'll see is that instead of taking a QKV layer, we have a Q layer that's separate from a KV layer. So every single element that is in your encoder, that is like the, the continental words that we pass in the decoder, they are going to form query vectors, but then we are going to convert every English word that we get as the output of the encoder to key vectors and value vectors. So you can see for self-attention, we created, you know, from the same continental words, we have query key and value vectors. But for cross-attention, if we go all the way here, now we're performing cross-attention. Let's say that this is coming from the English encoder. This is like the set of English words for the sentences in the encoder. Each word is now encoded into a key vector and a value vector from the encoder. But for the connoisseur side, we encode everything just into a query vector. And so while self-attention, everything is in every words encoded into a key key query and value. In cross-attention, we have the query coming from the decoder, but the key and the value coming from the encoder. And then we perform very similar operations along with no mask in this case. So that's why coming back here, we'll see that like the inputs here, like X is going to be, you know, the same dimension 30 cross 200 cross 512. And so Y will also be that dimension. But now after passing through the key and the value, again, I say every word is converted into key and value, but essentially they're gonna be stacked on top of each other as a linear layer. So we just convert 512 to a 1024 dimensions over here. And so because like every word is going to now have like a key and a value from the encoder side, this is going to be 30 cross 200 cross 1024. Every now for the query side, it's going to be 30 cross 200 cross 512. And we're going to now reshape this so that we can perform multi-headed cross-attention. And so the key and the value, let's take this, but instead of like 1024, it's gonna be eight times this headings dimension is 64. So I believe it's going to be 128. And we're also gonna reshape our query vector too, where the query vector is essentially also going to be multi-headed. So we'll take this, we'll paste it here. And this is going to be times eight, but instead of just two times the head dimension, it'll be just the head dimension itself, which is 64. Now we're going to permute the key and the value over here. So let's just take this, but we'll switch over the second and the third places. So that's 30 cross eight cross 200 cross 128. And the same kind of goes for the query itself too, which is 30 cross eight cross 200 cross 64. And then we will chunk the key and the value vectors here into just two parts based on the last dimension. So we'll take this and we're breaking into two parts. So this will be 64 for the key. And then we also have the same thing for the value essentially. And so we have query key and value tensors of like the same shape. And this is where we pass this in. Now all through the scale dot product attention again. So if we go back up here, we actually end up with the same shapes, right? 30 cross eight cross 200 cross 64. They all have the same shapes essentially in the mask. In this case, we don't have a mask, but all the operation shapes remain exactly the same. And even if we don't execute this code, it's not like the scaled, this scaled value, it's still maintained its original shape anyways. So it wouldn't matter. I hope that what's clear here is that in the decoder case, we actually don't even need for the cross attention part of the decoder case. We actually don't even need to execute this unless you put in like a padding tensor as well or like a padding mass so that you can ignore some of the padding tokens, which is required because, you know, some sentences might not be as long as like the total length, the maximum sequence length of like 200. But unless, if you don't do that, you don't really need this section over here that's not going to be executed and hence it's going to skip over and these shapes are going to be preserved. So it's kind of exactly what we want. And so what you're going to get is this output tensor over here put in our value tensor. And now we are going to now concatenate all the heads together as we did previously. And so this is going to be eight times 64, which is 512. So we'll just reshape that. And this is now going to pass through a linear layer which is not going to resize anything. It's just for additional information. And so we actually now have an out vector that is going to be, you know, 300 cross 200 cross 512, which is exactly what we input in here. So we copy this. This is multi-head cross attention. So let's go back to the decoder layer and the output here of Y is still going to be the same shape. Next we perform, you know, kind of the same procedures as before. We perform dropout, which doesn't change the shape. It just randomly turns off neurons and we'll also do layer addition with layer normalization, which does not change the shape again. And next we still have this. So we pass it through a feed forward connection. Now this feed forward network, FFN, I've defined it as position feed forward. Let's actually go to that just real quick. It's actually quite a simple function. So I shouldn't take too long. So we pass it through the input that we have, which is X is going to be like, you know, it's going to be the shape, right? That's X. We pass it through a linear layer, which is going to map it to a hidden dimension, right? So that's 2048. And then we pass it through ReLU, which doesn't change the shape. It just, it's an activation function that will allow the network to better understand and model complexity in your data and complexity in patterns. Typically like ReLU does have like some issues with like the dying ReLU problem, because, you know, for any activation value that is lower than zero, it will just completely shut it off. That's just the nature of the function. And so some other functions like Elu, Gelu, Leaky ReLU are also used, which you can experiment with too. But for now we have ReLU and that doesn't change the shape. Dropout doesn't change the shape. And then linear two, which is just going to be the inverse of the first linear layer where we still now get this D model. So let's just write this out. And this will be 512 dimensions. And this is returned. And so at the end of the day, when we do our position feed forward network, we are going to end up with a 512 dimension output here, which is the same as everything that we've seen before. Next we perform dropout, like we said, that changes no shape. Layer normalization changes no shape. And we return the value that we get at the output of one decoder layer. And this output is exactly the same dimensions that we input to the layer itself. So we take this value and this is the decoder layer that we will go over here, this decoder layer. Let's just put this in a much more readable way. So this is the output of one decoder layer. So this module executes once and now we have this Y, which would have been this shape over here. And then we just input that same Y. This has to go numb layers. We defined it as five. So this will execute like five times. So now this new tensor, which is slightly more context aware of the Kanada output language, is passed back into this module and we keep updating Y, which is like the Kanada target tensors for the Kanada words. Now once all of these are complete, the control will pass back over here, which the output is going to be, well, the same shape, right? And this is good. So the input now matches the output. And from this output, I haven't coded it out here, but it's essentially going to be mapped into a softmax, a linear layer of softmax. And from that, we can actually compute what the next words could be. So everything that I've kind of like discussed over here, I actually like put so many print statements so you can kind of see like, I've printed it all out in the end. You will see that's exactly what the output shape is, 300 cross two, 30 cross 200 cross 512. And to see exactly like what happens, you know, in the conceptual architecture, we're essentially way at this stage over here. We have the batch size cross max sequence length, cross 512. That's exactly what we see, 30 cross 200 cross 512, where we now can map it to a, you know, we can map each of these now to a Kanada word that is human interpretable and perform some softmax to determine what is the word that is supposed to come next. Now these pieces over here on the end, like this last piece over here and also like the pieces that we see way in the beginning over here, I have not coded out in this video, but I'm just trying to gradually piece all of these parts together. And so like in the next video, you're going to see like, okay, more of this transformer that's built out. I'm going to make another video on the entire, you know, architecture encoder and decoder all together in just one session. I'm also going to code out the entire neural transformer neural network architecture code and also some training and inference code too. So that's all I have for this video. And that's also some of my future plans. But if you like what I do, please do consider supporting me by a like and a subscribe that would be super helpful, especially if you think I deserve it. And also you can check out the description for the code related to this video and all the other past videos of this transformer neural network playlist. The playlist is called building transformers from scratch. There's about like 10 videos at the time of uploading this video, but there will be so much more to come every single week. And also look forward for like daily shorts. So thank you all so much for watching. And I hope you have a wonderful day and I hope you're like learning something new. If you learned something, please do comment down below. And until then I will see you next time. Bye-bye.