 Hello everyone. Welcome to another episode of Code Emporium where we're talking more about transformers. Now, I'm at the point in our entire dozens of videos that we've done just before this, which have culminated into building a transformer to translate from English to a language called Kanada. And this was supposed to be the last video, but unfortunately I'm running into some issues and chat GPT could not help me. So I thought this would be a great opportunity to kind of talk to everyone here so that we can somehow work through a solution together, if at all. So I have this entire code, which I'll just go through in a few minutes. But essentially for every training epoch, let's say this stuff is all training whereas this is more transformer inference. So right here, let's say I'm training a transformer now to translate from English to a language called Kanada. This is the initial loss at iteration zero. Here is the training English input sentence. Here is the labeled Kanada output sentence. This is what it should produce. But this is actually at this point what the model is producing during the training iteration, which is a bunch of garbage, which is what we expect. Now, during evaluation, what I'm trying to do is I basically put the model into evaluation mode over here. So if you see it's transformers in evaluation mode, and then I just evaluate on a very specific sentence every time, which is should we go to the mall? And I just want to see what kind of translation it would give. So this is actual inference time performance of the transformer. And lo and behold, when we do that, we also get absolute gibberish, which is exactly what we want actually, though. And as the training progresses, we can see that the loss goes down, which is a good sign. And you can also see that the training prediction sentence is actually getting close to the training label. So it might not look it's super apparent here. But as we go to the further iterations, you could see that they kind of even though if you can't read the language, you can say that they kind of look more similar to each other. However, what we now see is should we go to the mall? It's predicting just the end of sentence. So it's basically saying, oh, the translation of this sentence is a period. Yay, that's not right, though. And in fact, it goes into this very, very strange state of just predicting the end of sentence very early on during training. In fact, it gets that like, let's say about after 1000 iterations or so, it's still predicting some weirdness, and it's very much so different from this training prediction. Yet from the 1300th iteration, you can see it's just the end of sentence token. So my big question and why I've been like wracking my head for about a month now on this question is why exactly are we predicting the end of sentence token during inference? Whereas during training, we can kind of see that the outputs are happening as we should. Like this is exactly what we would expect as training goes on loss decreases. And we see that the kind of prediction output of what our output is during training is getting better and better. So that's the question that I have. I'll be now everything from here is just going to be describing the code. And I also want to to just like pose this question to you to see, can you help me solve this too? And this would be great for, you know, kind of a community exercise that we can go through together. So first we start by importing some transformer. This is the transformer code that we have been coding for the past like 10 videos. And the culmination of it is this wonderful 300 piece line of code. I have described all of this code line by line in my past videos. So please do check them out. It's in the playlist called transformers from scratch. Now we have the source of our English sentence as well as our target Kanada sentences for which we want to train. And those are in two little files called train.en and train.kn and they can be found here. So it's a list of sentences in English and their corresponding sentences in Kanada. Next we have like start padding and tokens. We define the entire vocabulary for both languages. We then just try to get each of these sentences into like a list. So English sentences corresponding Kanada translations. And then I'm just trying to clean up the sentences just to make sure we're only getting the valid ones. So I'm seeing like 200,000. Well, not 200,000. I should say 164,000 valid English Kanada translations is what I'm using as my data set for training the transformer. And then we're also going to be batching this data to and so I would be calling my transformer kind of like this. Where we have D model, which is the internal vector size for every single character since we'll be generating one character in this model at a time. So each character will be represented by 512 dimensional dense vector. We have a batch size of 30. This is just to parallelize and speed up training. FF hidden is the feed forward layers within the transformer. This is going to have 2000 neurons. Number of heads is the number of parallel attention heads during multi-headed attention, which is both self attention and cross attention. We have like eight parallel branches. So the more there are the more parallel ways that we can actually perform attention and it'll be faster. Drop probe is the drop out probability value that we passed so that we can randomly turn off neurons so that the network will learn to generalize better and not learn to overfit. Num layers, in this case, I've just reduced it to one, which is the number of encoder and decoder layers all together. Max sequence length is the maximum number of characters given in a sentence of both either English or in the in the target Kanada language. And then we just get the length of the vocabulary, which is the total number of characters that can possibly be generated in the language, which may be about a few dozen, but it can be as high as a few hundred depending on the language. And then we just call the constructor for the transformer to actually build our transformer model. Now we can use like PyTorch data sets. I create a text data set just to process text data. PyTorch requires us to create these data set objects so that we can standardize the way in which they which information is fetched, retrieved and trained. We then now we can define I define our cross entropy loss. So whenever we go, let's go to the model architecture over here. You see that when we have the decoder over here, we'll be generating one character in this case at a time. But this could also be one word at a time or one byte pair encoding at a time. There'll be an output generated. This needs to be compared to the label. And whenever they are compared to the label itself, we would see, oh, okay, this is the correct word or it's not the correct word. We would generate a loss. We use the cross entropy loss here and then back propagate through our network by processing a batch of sentences though. Now we're initializing the parameters of our transformer also defining our optimizer with a learning rate and then also going to be using a GPU. Now this code over here is used to generate the masks. So when we look over in this transformer architecture, we see these orange boxes. There's three of them. There's actually two types of masks. So one of them is a padding mask and the other is a look ahead mask. The padding mask will essentially make sure that well, since the maximum sequence length is 200, we will need to pad our tokens so that we could process fixed size vectors. And we don't want these tokens though to participate in the attention mechanism process. We don't want words in our sentence to pay attention to these padding masks. So we need to kind of block them out and show them as let's say negative infinity. Now on the same time we have the look ahead mask. The look ahead mask is going to be like, okay, which for a given character we should not be looking forward. This is particularly useful in the decoder self attention case because that would be considered cheating because during training we have all the sentence data. But let's say that we're looking at let's say a specific word here. We only have access to the words that come before it, but no word that comes after it. We should not be able to attend to it. And hence we're going to be using a look ahead mask as well. So that's kind of the gist of what this function does. And you can kind of see here for the encoder self attention mask. I have a padding mask for the decoder self attention mask. We have the look ahead mask and the padding mask as well. And then the decoder cross attention mask. We just have a padding mask. And we set all of the values to be masked to negative infinity. Practically speaking, we set it to a very low, well a very low number, very low negative number, which is close to negative infinity, I guess in some regard. Because if we do a negative infinity, we would get like a nan as a loss function because there's softmax operations that go in between and we'll get like zero by zero errors. That will lead to not number losses and there's no like training in this network. So to stabilize those values, we use negative infinity over here. And then we just have our training loop. So with the training loop, we initiate our epoch wheel will generate our data. We get a batch of data. We put the transformer in train mode. Then we're going to make a prediction by calling the transformer. Then we're going to get the labels, which is like the corresponding actual. So this is the predicted kind of a sentence. This is the actual label for that sentence. We will then compare these two to get a loss. And then we are going to, well, we don't care about the loss from the padding tokens at all. So we're just going to set them to zero. But whatever non zero losses are there are no actual losses from each individual token, which we will take the average of. And this is going to be like your five point, whatever that we saw down here, whatever these numbers are five point five, right? And then we'll take this loss and then perform the back propagation step. And then what we're going to do now is we're going to perform an inference. I just wanted to, um, for every single like hundred iterations, I wanted to just infer like one sentence just to combine both the training and inference at one shot. So I don't have to train the model the entire time when debugging and then try to perform inference. So I put the model in evaluation mode. And then I just try to translate, should we go to the mall? And that's evaluation translation of this should be in this K and underscore sentence, which we have seen will eventually just devolve to becoming a period or an end token. Why that happens? Well, I'm leaving the question to everyone here. Now I have asked this exact question on stack overflow, even with bounties in the past. And I asked this well over, you could say a month ago, and still I haven't gotten a single reply. And so I'm asking everyone here. I even went as far as asking chat GPT. Well, first of all, I said, can you answer this specific question from stack overflow? Turns out chat GPT cannot actually read the link, but it tries to infer its response based on whatever this English in the link is. And so it gives a result. However, when giving the result, there were some sentences that were a little interesting. It says check the input sequence is properly padded and that the padding tokens are masked during attention calculations. So I asked chat GPT to expend exactly on that statement. And then it went into giving me a spiel about how, like, for example, we have whenever we're passing sentences into the encoder, we don't really need like a start token and end token. But we should be like completely padding those tokens for self attention along both the rows as well as the columns. And so this kind of got me into thinking about, OK, maybe something is wrong with my mask. And that's where I want to see if we can focus more on now. I'm trying to dig into exactly this mask issue. And so I said, what if we were to do cross attention between an example to example vectors over here? And it eventually gave me this is how well it should look if this was the cross attention. And you can see like, OK, we have padding mass all over all these zeros are just padding mass. And then these here are kind of like how you would see the non non padding token attention matrices over here. Now the issue with this particular response in chat GPT is that you need to actually make this a small negative number. Otherwise you will get NANDs. And I tried like coaching chat GPT eventually and it does generate, you know, just negative negative one E9, which is like negative 10 to the power of nine, which is how it should be. So chat GPT it sometimes does generate some wacky outputs. Like you'll kind of see like there's random ones in the middle here. But with appropriate coaching, you can kind of get chat GPT to be aligned with what's actually true. However, all of this basically just tells me that there may be something wrong with my mask or the implementation of the mask specifically. And that would lie in this function over here. However, I've tried kind of finagling this in every which way I possibly could. And so I'm opening the form to everyone here. I will be leaving links to all of the data, the code, the stack overflow link, all in the description down below, as well as the historical playlist. And so if anyone can help solve this problem, I will be eternally grateful and I don't have to leave it to the so non toxic environment of stack overflow. That'll do it for now. Opening this up to you, as I said for the 10th time, but thank you all so much for watching. Please do like the video if you think I deserve it. Please you check the playlist of all of the videos before this for building this transformer from scratch. And I will see you all with a fresh start new videos coming out very soon. Thank you all so much and take care. Bye.