 Hello everyone and welcome to another episode of Code Emporium where we're going to talk about the full transformer architecture code. It's about 300 lines of code here and the way that I'm going to explain this is we've already taken a look in past videos about the encoder separately along with its architecture and code and the decoder with also its architecture and code. And so this is just going to marry the two together and I'll only be giving details that I didn't really explain in detail in those videos. So it's probably going to be a short one, but let's get to it. So as a reminder, we are currently in the process of building a translator from a language called English to a language called Kannada, which is a language that's spoken in a specific state in India. And the way that we're going to generate the translation for an English sentence is we're generating one Kannada character at a time and not one word at a time or one byte parent coding at a time, as you would see probably some other implementations do. So you pass in English sentences over here and for each character, we will get a vector that encodes some contextual meaning to it and we'll take these vectors, pass it into the decoder architecture along with the Kannada vectors, which are transposed like by one word, there's going to be an additional start token preceding the sentence. And then we'll have during training the entire Kannada sentence, but without the start token. And so in the end, this is a way to parallel train the model so that it's much faster in training. And during inference time, though, we would just use the start token and pad the rest while passing it into the entire decoder architecture to generate the first Kannada word. That word is then taken and passed in to the output section, whatever it says here in order to generate the second word. And we do this until we get the end of sentence token to generate the full Kannada sentence, one character at a time. Now, with that context out of the way, we can actually look at the code. So this here is the transformer class that implements or extends the torch modules. We have D model, which is going to be the dimensions for every single character vector. Typically in this case, we take it as 512 ff hidden is the hidden layers for the feed forward layers that we see throughout the network, which is 2048. Num heads is the number of heads that we see in the multi-headed self attention and also multi-headed cross attention. Drop probe is used for dropout as the probability that we will switch off neurons to promote generalization through the network. This is typically 0.1. Num layers is the number of encoder layers as well as the number of decoder layers and we can set this to be like five or something. Max sequence length is the maximum number of characters in a given input sentence, which we, in our case, we set it to about 200. KN vocab size is all the possible characters that can occur in the Kannada sentence or translation. Then we have English to index, which is taking a dictionary. It's a Python dictionary that maps a character to a unique number. Same thing for Kannada. And then we have a start token and token and padding token, which would be the start and end and padding tokens respectively. Now, during the transformer call to the forward pass, we will be taking in X, which is the English sentence or rather more specifically it's the batch of English sentences. Why is the batch of target Kannada sentences? Then we have a couple of masks here. So we have the encoder self attention mask, which is going to incorporate the padding masks. Then we have the decoder self attention mass, which is going to incorporate both the padding mask as well as the look ahead masks. And then we have the decoder cross attention mask, which will also incorporate the padding mask. And these four other parameters are going to determine whether we should include a start token and an end token to our encoder and decoder inputs and outputs. In this case, false means that we do not include it. And true means that we do. So we'll first start by taking the batch of sentences X, passing it through the encoder in order to get the list of character vectors that are eventually going to be context aware. We then take those vectors into the decoder along with the batch of Kannada sentences, which is why, in order to finally get the output of that decoder, and we're going to be eventually passing this into a softmax activation when we're computing the loss function. If we take a look at the encoder, so this X here as the input is going to be the batch of English sentences, we're going to construct the sentence embeddings. This is going to eventually take every single sentence. We're going to use the character. We have to map it to integer values for every single sentence and every single character in the batch. And then we're going to pass it through the encoder and the self attention mask is going to be just the padding mask because we in the English sentence are allowed to look forward and look backwards. So we don't need that look ahead mask here. And so we're just going to get a list of contextually aware characters in the form of vectors. The decoder on the other hand is going to take these list of vectors X, but also the batch of input sentences why we're going to perform the sentence embedding, that is, we're going to perform like the tokenization of characters. And then we're also going to include whether we include like a start token or end token. And this depends on whether it's the input to the decoder or the output of the decoder. So the input to the decoder is only going to include a start token, but the output is not going to include the start token, but it will include that end token. We then pass it through the decoder passing in the self attention mask and the cross attention mask. And then we get the output, which is a set of Canada character vectors. I'm going to be uploading this entire file in this GitHub repo where it has all of my past code experiences here in the last like dozen videos that we have made on this, and it's essentially going to be the combination of the transformer encoder explain and transformer decoder explain. But you can see that I've actually broken this out and spelled out individual outputs, but there's also accompanying videos for each of these files. So I do recommend you check it out for the next video. I did plan on showing you the exact translation of like, let's see this entire thing in action, but I am running into some issues and I've actually been using some chat GPT to help me try to debug those issues. I haven't been able to do it yet, but I'm going to leave that for the next video to just ask you as the audience to help me with something that chat GPT potentially couldn't really help me with. So stay on the lookout for that video. And until then, thank you all so much for watching and I will see you in the next one. Bye bye.