 7 years, 250 videos, and 100,000 subscribers later, I bring you this special. In this video, we are going to talk about transformer neural networks from what they are, why they exist, to I can now build a transformer neural network from scratch using NumPy and PyTorch. If you're new here, hi, I'm Code Emporium, but if you're not new here, thank you all so much for supporting me over the years. Now let's get to learning some transformer neural networks. For current neural nets, they are feedforward neural networks rolled out over time. As such, they deal with sequence data, where the input has some defined ordering. This gives rise to several types of architectures. The first is vector to sequence models. These neural nets take in a fixed size vector as input, and it outputs a sequence of any length. In image captioning, for example, the input can be a vector representation of an image, and the output sequence is a sentence that describes the image. The second type is a sequence to vector model. These neural networks take in a sequence as input and spits out a fixed length vector. In sentiment analysis, the movie review is an input, and a fixed size vector is the output indicating how good or bad this person thought the movie was. Sequence to sequence models is the more popular variant, and these neural networks take in a sequence as input and outputs another sequence. So for example, language translation. The input could be a sentence in Spanish, and the output is the translation in English. Do you have some time series data to model? Well, RNNs would be the go to. However, RNNs have some problems. RNNs are slow. So slow that we use a truncated version of back propagation to train it. And even that's too hardware intense. And also, they can't deal with long sequences very well. We get gradients that vanish and explode if the network is too long. In comes LSTM networks in 1991 that introduced a long short term memory cell in place of dumb neurons. This cell has a branch that allows past information to skip a lot of the processing of the current cell and move on to the next. This allows the memory to be retained for longer sequences. Now to that second point, we seem to be able to deal with longer sequences well. Or are we? Well kind of. Probably if the order of hundreds of words instead of a thousand words. However, to the first point, normal RNNs are slow. But LSTMs are even slower. They're more complex. For these RNN and LSTM networks, input data needs to be passed sequentially, or serially, one after the other. We need inputs of the previous state to make any operations on the current state. Such sequential flow does not make use of today's GPUs very well, which are designed for parallel computation. So question, how can we use parallelization for sequential data? In 2017, the transformer neural network architecture was introduced. The network employs an encoder-decoder architecture much like recurrent neural nets. The difference is that the input sequence can be passed in parallel. Consider translating a sentence from English to French. I'll use this as a running example throughout the video. With an RNN encoder, we pass an input English sentence one word after the other. The current word's hidden state has dependencies in the previous word's hidden state. The word embeddings are generated one time step at a time. With a transformer encoder on the other hand, there is no concept of time step for the input. We pass in all the words of the sentence simultaneously and determine the word embeddings simultaneously. So how is it doing this? Let's pick apart the transformer architecture. I'll make multiple passes on the explanation. First pass will be like a high overview, and the next rounds will get into more details. Let's start off with input embeddings. Computers don't understand words. They get numbers, they get vectors, and matrices. The idea is to map every word to a point in space where similar words and meaning are physically closer to each other. The space in which they are present is called an embedding space. We could pre-train this embedding space to save time, or even just use an already pre-trained embedding space. This embedding space maps a word to a vector. But the same word in different sentences may have different meanings. This is where positional encoders come in. It's a vector that has information on distances between words and the sentence. The original paper uses a sign and cosine function to generate this vector. But it could be any reasonable function. After passing the English sentence through the input embedding and applying the positional encoding, we get word vectors that have positional information, that is context. Nice. We pass this into the encoder block, where it goes through a multi-headed attention layer and a feed-forward layer. Okay, one at a time. Attention. It involves answering what part of the input should I focus on. If we are translating from English to French, and we are doing self-attention, that is attention with respect to oneself. The question we want to answer is, how relevant is the ith word in the English sentence relevant to other words in the same English sentence? This is represented in the ith attention vector, and it is computed in the attention block. For every word, we can have an attention vector generated which captures contextual relationships between words in the sentence. So that's great. The other important unit is a feed-forward net. This is just a simple feed-forward neural network that is applied to every one of the attention vectors. These feed-forward nets are used in practice to transform the attention vectors into a form that is digestible by the next encoder block or decoder block. Now that's the high-level overview of the encoder components. Let's talk about the decoder now. During the training phase for English to French, we feed the output French sentence to the decoder. But remember, computers don't get language. They get numbers, vectors, and matrices. So we process it using the input embedding to get the vector form of the word. And then we add a positional vector to get the notion of context of the word in a sentence. We pass this vector, finally, into a decoder block that has three main components, two of which are similar to the encoder block. The self-attention block generates attention vectors for every word in the French sentence to represent how much each word is related to every word in the same sentence. These attention vectors and vectors from the encoder are passed into another attention block. Let's call this the encoder-decoder attention block since we have one vector from every word in the English and French sentences. This attention block will determine how related each word vector is with respect to each other. And this is where the main English to French word mapping happens. The output of this block is attention vectors for every word in English and the French sentence. Each vector representing the relationships with other words in both the languages. Next, we pass each attention vector to a feedforward unit. This makes the output vector more digestible by either the next decoder block or a linear layer. Now the linear layer is, surprise, surprise, another feedforward connected layer. It's used to expand the dimensions into the number of words in the French language. The softmax layer transforms it into a probability distribution, which is now human-interpretable. And the final word is the word corresponding to the highest probability. Overall, this decoder predicts the next word, and we execute this over multiple time steps until the end of sentence token is generated. That's our first pass over the explanation of the entire network architecture for transformers. But let's go over it again, but this time introduce even more details, going deeper. An input English sentence is converted into an embedding to represent meaning. We add a positional vector to get the context of the word in the sentence. Our attention block computes the attention vectors for each word. Only problem here is that the attention vector may not be too strong. For every word, the attention vector may weight its relation with itself much higher. It's true, but it's useless. We are more interested in interactions with different words, and so we determine like eight such attention vectors per word, and take a weighted average to compute the final attention vector for every word. Since we use multiple attention vectors, we call it the multi-head attention block. The attention vectors are passed in through a feedforward net one vector at a time. The cool thing is that each of the attention nets are independent of each other, so we can use some beautiful parallelization here. Because of this, we can pass all our words at the same time into the encoder block, and the output will be a set of encoded vectors for every word. Now the decoder. We first obtain the embedding of French words to encode meaning. Then add the positional value to retain context. They are then passed to the first attention block. The paper calls this the masked attention block. Likewise is the case though, it's because while generating the next French word, we can use all the words from the English sentence, but only the previous words of the French sentence. If we are going to use all the words in the French sentence, then there would be no learning. It would just spit out the next word. So while performing parallelization with matrix operations, we make sure that the matrix will mask the words appearing later by transforming it into zeros, so the attention network can't use them. The next attention block, which is the encoder-decoder attention block, generates similar attention vectors for every English and French word. These are passed into the feedforward layer, linear layer, and the softmax layer to predict the next word. That's the pass too over the architecture explain. I hope you are understanding more and more details here. Now for the next pass, where we go even deeper. Now exactly do these multi-head attention networks look. Now the single-headed attention looks like this. Q, K, and V are abstract vectors that extract different components of an input word. We have Q, K, and V vectors for every single word. We use these to compute the attention vectors for every word using this kind of formula. For multi-headed attention, we have multiple weight matrices, W, Q, W, K, and W, V. So we will have multiple attention vectors, Z, for every word. However, our neural net is only expecting one attention vector per word. So we use another weighted matrix, W, Z, to make sure that the output is still an attention vector per word. Additionally, after every layer, we apply some form of normalization. Typically, we would apply a batch normalization. This smoothens out the loss surface, making it easier to optimize while using larger learning rates. This is the TLDR, but that's what it does. But we can actually use something called layer normalization, making the normalization across each feature instead of each sample. It's better for stabilization. Now that we have a brief overview in mind, we can now start delving into individual components of the transformer neural network. So let's start with the attention mechanism. This here is a recurrent neural network that's rolled out. And X's are the inputs, the O's are the outputs, the H's are the hidden states, and the Y is the training label. Recurrent neural networks used to be the state of the art for sequence to sequence modeling. Sequences can be an ordered set of tokens, which could be like a set of words to form a sentence, for example. And so these recurrent neural networks in many applications of natural language processing were the state of the art for sequence to sequence tasks. However, they suffer from two main disadvantages. So the first is that they're slow. We need to feed these inputs one at a time in order to generate the outputs sequentially one at a time. Also, their training algorithm is pretty slow, too. We use a truncated version of back propagation to train them. And the algorithm is called truncated back propagation through time. But probably a more pressing issue is that these vectors that are generated intermediately for every word in case of a word language model, we're not sure if they actually truly represent the context of a word itself. After all, the context of a word depends on the words that come before it, as well as the words that come after it. But it's very clear that from a recurrent neural network perspective and architecture, we're only getting the signals from the words that come just before it. Even the bi-directional recurrent neural networks kind of suffer from an issue here because they just look at left to right and right to left context separately and then concatenate them. So there might be some true meaning that's kind of lost when generating these vectors for every word. One way to improve the quality of the vectors generated is via an attention mechanism. So for example, let's say that we have an input sentence that is my name is a J. It's four words where each of them can be represented by their own vectors. Using attention, we can decide which parts each word needs to focus on. In this case, there's like a table four by four where the bright spots are the spots where this word is focusing on. So for example, my is actually focusing a lot on the word name and it's going to use the context of name in order to incorporate it in its own feature vector. And the same thing is with a J where a J and name are actually quite closely related. And so the vector that corresponds to a J is going to incorporate some more context with respect to name, which comes before it. And so using attention, we can have every word vector better incorporate the context either before it or after it. All of it can be incorporated much better in their vector than the corresponding recurrent neural network counterpart. This form of attention here is self attention because we are attending on the same sentence as we are using as input. However, attention has many other forms and can also be used in other applications, including like computer vision. The transformer neural network architecture has the attention piece at its crux. Let's actually walk through an example where we're going to translate this English sentence to a Kannada sentence. Kannada is a regional language in India. I am from the state of Karnataka within India where this is spoken. So we're going to just work with that language and you're going to learn some some Kannada. So that's going to be fun. We now have the input sentence. My name is a J. We pass this simultaneously all these words into the encoder of the transformer neural network architecture. This will generate four vectors, one for each word. Now, technically, the way that's implemented is that these transformer architectures are going to generate word pieces or subwords or byte pair encodings, which are like broken down versions of these words and not full words themselves. But I'm just going to call this a word level language model just for simplicity here. Now that we have these vectors generated, we then pass this simultaneously all into the decoder architecture. And we start with a simple start token. Passing this in and using these vectors, we now generate the first word. In this case, it's it needs to generate the Kannada translation of this sentence. And so it'll start by generating the first word, which is none now. This means my in Kannada. This is now taken as an input into the decoder for the next phase. And it's going to generate the next word, which is Hasaru. This means name in Kannada. And then once again, it takes this word as the input in order to generate the next word, which is a J. And so transformers can be used to translate from English to any other language like Kannada and also other sequence to sequence tasks. Now, the part of the video that we want to really focus on is this encoder piece. In fact, I'm going to break down this encoder piece further by saying here are the four inputs. My name is a J. After positional encoding, we're going to get these yellow vectors. And typically the size of the vectors given in the paper is about 512 dimensions for each. Then these are then passed into our encoder. And the encoder is going to generate another set of vectors. The idea here is that through this attention mechanism, these vectors are going to be much more context aware and hence higher quality than these vectors over here. And specifically the main crux of the reason is this multi-headed attention part. And so we're going to dive into some code, intuition and math behind exactly how this works. Now, one thing to note from this transformer architecture is how exactly does this architecture overcome the disadvantages of the recurrent neural networks that I mentioned before. So the first disadvantage I mentioned was slow training because of sequential inputs and sequential outputs. In this case, we can actually process data in parallel and so we can make use of modern GPUs. And as I mentioned before, in order to make sure that these vectors are of higher quality and more context aware, we have the attention mechanism that's built in. So let's dive further into this attention mechanism. Now, the multi-headed attention part in the transformer architecture kind of looks like this from the main paper, but it's pretty confusing. So I'm going to be breaking this down. Essentially, every single word that is input to our transformer is going to have three vectors. We have a query vector, which indicates what am I looking for? We have a key vector, which is going to say, what can I offer? And then we have the value vector, which is what I actually offer. Now, jumping into a Colab notebook, we're going to be implementing this all with NumPy. So we have a query vector, a key vector and a value vector that is randomly initialized. L is going to be the length of the input sequence. In this case, it's my name is a J. So I put it as four. And then the size of each of these vectors for illustrative purposes, I've put it as being eight. And I have randomly initialized via the normal distribution using the random end function. Doing so, you kind of get these vectors that look like this. So it's for every single word, this could be my, for example, it's going to have an eight cross one vector for value and for key. It's also going to have another eight cross one vector. And then for Q, it's also going to have an eight cross one vector. All of them for the word my, the same as for the name is and a J. Next, let's talk about some self-attention. In order to create an initial attention matrix, we need every word to look at every single other word, just to see if it has a higher affinity towards it or not. And this is represented by the query, which is for every word, it is what I am looking for. And then a key, which is what I currently have. This product leads to a four cross four matrix because we had a sequence of four words. My name is a J. And in each case, it's going to be proportional to exactly how much attention we want to focus on each word. Now, for example, here, this first line is going to be for the my vector and how much it's going to focus on other vectors. In this case, it's going to focus the most on the word name. And similarly, we can see that for other cases. This here is the huge crux of the attention figure that I showed earlier. Now, second is, why do we need this little denominator, the square root of some dimension of Q and K? Well, this is because we want to minimize the variance and then stabilize the values of this Q dot K transpose vector. We can actually see that by looking at the variance of the query vector, the key vector, and the multiplication of both of them. And while the query and key vector are close to one, the variance of the multiplication is much higher. And so in order to make sure that we stabilize these values and reduce its variance, we divide it by the square root of the dimension of the query vector. And so you can see that these values are now much more in the same range. And so if we actually apply this scaling, you'll see that the vector generator will now have values that are also much lower variance and in the same range. The next step we can talk about is masking. So masking is required specifically in the decoder part of the transformer neural network so that we don't look at a future word when trying to generate the current context of the current word. This is because it would be considered cheating. In reality, in the real world, you don't know the words that are going to be generated next. So you can't really create your vectors based off of those words. However, for the encoder, masking isn't really required because all of our inputs are passed into the transformer simultaneously. Yet, let's actually walk through this process of what's going on. So first, I'm creating this triangular matrix with all of the values below the diagonal as 1 and above the diagonal as 0. And this will simulate the fact that I just mentioned where, for example, my in the sentence, my name is a j, can only look at itself and nothing else. Name can only look at my name is, can only look at my name is, and a j can only look at my name is a j. Now I'm actually going to transform this such that I just made every single one a zero and also every zero and negative infinity. I did this because if you apply this mask, you'll notice that it's the exact same values for the lower diagonal as it was without the mask over here. But the values that are above that mask are just going to be considered as negative infinity, which means that we're not really going to be getting any context from it. Specifically, why negative infinity and why zero is also because, well, first, I'm adding it here. And second, it's also because of the softmax operation that we're going to be performing next. The softmax operation is used to convert a vector into a probability distribution so that their values add up to one and they're also very interpretable and stable. So I've done a little an interpretation of exactly the math in this sentence over here and on applying this math, we kind of get the final attention function over here as this value. Now I've applied the mask, but let's say hypothetically if I didn't apply the mask, let's just see how it looks. The numbers will look like this, where you'll see every row is going to add up to one because it's a probability distribution. But if you apply the mask right now, you'll notice that the attention vector actually doesn't incorporate anything, any word that comes after it because we don't want any context after it. This basically means that the word my is only going to focus on the first word and the name can only focus on my name and it's getting the weights as such. My name is is going to be, can focus on the first three and a j can focus on the first four. This is required for the decoder, but we don't need the mask for the encoder. Now if we multiply the attention matrix and the value matrix, we actually get these new set of matrices which should better encapsulate the context of a word. You can compare the before attention with the after attention here. And because this is masked, you'll notice that the first vectors like my are going to be almost exactly the same. Whereas as you go to the later words, you'll notice how different these vectors actually become. I've now put all of this logic into a function right here which takes in the query vector, the key vector, the value vector and an optional mask. And this can be used for both the encoder and the decoder. Where in the encoder, we don't really need to pass in this mask where we're going to have like the new vectors that look like this and the attention vectors that can actually pay attention to any word. But we can also for the decoder go back and pass in the value of mask such that we make sure that no word is able to get context from the words that come after it. And so we have another type of value matrix that's generated. What we did right now was just for a single attention head. And like that, we can have multiple attention heads and then stack their results on top of each other in order to get multi-headed attention. And according to the paper, this is actually what is being done in the actual production transformer neural network architecture. Now that we have an idea of attention and even how to code it out in Python, let's actually look at multi-headed attention. I'm just going to take an example word vector over here. This could be like one of the words of my name is a J. Let's just say that it's the net vector for name. This is going to be a 512 dimensional vector which we break down into three component vectors. And every word like this is going to have a query vector which represents what am I looking for. The key vector which is what I can offer and the value vector which is what I actually offer. And each of them are their own 512 cross 1 vectors. Now in actuality, these vectors are actually each broken up into separate pieces. In this case, it's broken down into eight pieces and each piece is going to be a part of creating an attention head. So we have eight attention heads so we break each vector down into eight pieces. And now each of these is then fed into some attention unit which we will code out very soon, along with all the other words too. This is just a breakdown of the word name but we also have the my, the name, the is and a J and all of them will be broken down in a very similar way passed to an attention unit. And we're going to generate for each head this attention matrix which is going to be a sequence by sequence in length if the sentence. In this case, it's a four cross four because there are four words in the sequence. And each of these rows will add up to a one because it's a probability distribution. And there will be eight such attention matrices as we have eight attention heads in this multi-head attention system. This is then going to generate other output vectors that are concatenated in order to generate a vector that actually has very good contextual awareness. Making this input word vector more contextually aware is actually the goal of this attention unit and this multi-head attention mechanism. Let's now code all of that out. So I'm going to be importing a bunch of libraries and I'm going to be using PyTorch for this demonstration. I now set a few variables. The sequence length is the length of my input sentence. And typically you would set a maximum sequence length so that all of your vectors are going to be fixed size. In this case, it's going to be four. That's my name is Ajay. The batch size is going to help in parallel processing. In this case, I'm sending it to one for demonstration purposes. We then have input dimension, which is the vector dimension of every word that goes into the attention unit. Then D model is the output of the attention unit for every single word. And now this X over here is going to be some randomly sampled input since I'm not going to be creating the position encoding and the input phase right now. I'm just generating some random data. The input is going to be of batch size cross sequence length cross input dimensions, which is one across four costs 512. Note that this X value is not the value over here, but it's the value that's input at this point just before we get into the multi-head attention phase. Now we're going to be mapping the input from this dimension of 512 to three times the model dimension, which is again three times 512. And this is done to create the query vector, the key vector, and the value vector all concatenated. And all of them have all the eight attention heads, which we will split up later. We then pass the input to this layer to generate the QKV vector. And this is its dimension. It's one batch, four words, and each word vector is 1,536 in size. Now I wanted to get an idea of what are the kind of values that we see here. And since I'm sampling from a random normal distribution, you'll see the values that look like this. But this distribution of all the values in this entire tensor is going to be very different, depending on how we generate the data, depending on the positional encodings and the inputs. Now we have eight attention heads that we're considering, and each is going to be 512 divided by eight, which is going to be 64. And so we will now reshape our QKV matrix to break down the last dimension into a product of the number of heads, and of course, three times the head dimension. And this three, just as a reminder, exist because it's a combination of the query vector, the key vector, and the value vector. And so we get a tensor that looks like this in shape. Now I'm going to be switching around just the second and the third dimensions so that the head is over here, and then the number of sequences is over here. So it's easier to perform parallel operations on these last two dimensions. Now we obtain the query key and value vectors individually by basically breaking down this entire tensor by its last dimension, and hence the input is minus one. And so this 192 breaks down to three components of 64 dimensions each. And this is where you see the query key and the value vectors broken down, as I mentioned before. Now we're actually going to perform the attention mechanism. A lot of this process is kind of what I mentioned in the last video, so you can look at more details there if you want to get a much deeper explanation, but I'll still go through some parts of it, since I'm doing it in PyTorch here and there, I did it in NumPy. So first of all, we're going to get the size of one of these vectors, which is dk. This should be 64 in our case. Now every word has a query vector, and it's going to compare its query vector to every other word's key vector, and that's represented by this matrix multiplication over here. And we can represent that by this code. Now notice that you have to use the transpose function, and you can't just do like a dot t to typically transpose, because these are tensors that are four dimensional and not simply just two dimensional matrices. And so we specify the transpose along with the dimensions along which we want to transpose. In this case, we wanted to transpose the last two dimensions, which are the sequence length, as well as the dimension size, or the head dimension size for every one of these words in every head. And so the last two dimensions of the query vector are like four cross 64. This will be a 64 cross four. And so we'll end up with a four cross four matrix, which is basically the sequence length by the sequence length. We're scaling here in order to make sure that the variance of these values is much smaller, so that these values just don't go out of control, especially since this is a trainable machine learning model. This here is just a dummy example just to show how important transpose is, and also that it takes in these two values of which dimensions along which you want to transpose your tensor. And it doesn't matter about the ordering as long as the same dimensions are present, it'll yield the same result. Now let's talk about masking. So we notice here that in the encoder, we don't really require any kind of masking, whereas in the decoder, we require masking for self-attention. And this is done to ensure that the decoder does not cheat. The goal of the attention mechanism is to gain context from words that are around it. During the encoding phase, we actually have all words which are passed in in parallel simultaneously, so we can generate vectors by taking the context of words that come before it, as well as words that come after. In the decoder, however, we generate words one at a time. So when generating context, we only want to look at words that come before it because we don't even have the words that come after it. Coming back to the code, we're basically going to have our scaled tensor, which is our 1 cross 8 cross 4 cross 4 tensor that we generated just previously. And we're going to fill this up with negative infinity values. And then we're going to basically take this mask, and this is like an upper triangular mask where we leave the upper values above the diagonal, the same as they are, and I'll fill the lower diagonal with just zeros. This diagonal argument is just to say how many positions above or below the diagonal we want to actually fill with those zeros. And so if we just print out the mask for a single head, it'll look like this. And note that it's the same exact dimensions as our scaled tensor. So we can add them both together, and we'll get, well, this is the tensor for one head, which will be a 4 cross 4 matrix. It's going to look like this, where we have lower diagonal elements, the exact same as what's scaled would have been because we're just adding zero, and the upper diagonal elements would be whatever the negative infinity is. Now we're doing negative infinity over here, specifically because we're going to be doing a softmax operation, which takes like exponents. And so the exponents of zero will become one, the exponents of negative infinity will become zero, so that you cannot cheat and look forward. And so we can apply that softmax here because these tensor values are all over the place. And when you actually apply the tensor values, you'll see that the sum of each of these records will become one. And in this case, I just tried to do it for this example where we're taking e to the power of this first term divided by the sum of e to the power of all of these terms, which I've taken here. And you'll see it's like 0.6269. And on applying the actual attention, you'll see that that's exactly what you get here. Now we apply attention though, using the softmax built-in function. We'll apply it to our new scaled tensor. And the dimension we apply it to is the last dimension, and we want to just apply it to this last dimension, which incorporates this row itself, row by row. We now take this value vector, which is, remember, this is what is actually being offered by every single word, in order to generate the new value vectors. And the idea here is that these new value vectors are going to be much more context aware than the original value vectors and the original input vectors. And so we'll end up with for every batch, for every head, for every word in the sequence, we'll have a 64 dimensional vector. Now I've created a function that does exactly everything that I just described, and I also added in an optional mask function, depending on whether we're dealing with the encoder or the decoder. So for example, if I just execute this, you'll see that we have an attention vector that kind of looks like this. But let's say that we actually pass in the mask that we just created. In this case, you'll see that the size is the exact same, but the attention vector is now going to be masked for the decoder. So whatever I have the encoder, this is good for the decoder. And then we can now have all of the value vectors for every single attention head, for every single word, which are 64 dimensional vectors. And then what we do now is we are going to now combine or concatenate all of those heads together. And for eight heads, we're going to now make them 512 dimensional vectors, which is exactly the input dimension. And then so that these heads can also communicate with each other the information that they've learned, we're just going to pass it through a linear layer, which is just a feed forward layer 512 cross 512. And this doesn't change the dimension. And so this output vector is now going to be much more context aware than the input vector was. Now all of this code entirely, I've combined together as well by creating, well, this is the same function I wrote before, but I've also written a class called multi headed attention, which is going to now take in a constructor, which will initialize some of these parameters that I talked about along the video, as well as a forward path. So when you execute a forward pass of our multi head attention, it's going to execute the same exact lines of code that we just talked about. Now I kind of executed this for some random inputs where the input dimension was much larger. And the batch size is also something that's more reasonable. I just put it as 30 instead of just one. And I execute the forward pass of this multi head attention. And you'll see the same kind of results where the input batch 30, we have a like five words in the sentence and it's going to generate vectors of, let's say each of them is 1024. Then the query key and value vectors of all heads are combined to generate a 1,536 dimensional vector. This is then broken down to eight attention heads. We swap these to the attention heads as well as the number of send words in the sequence. And then we have break this down into query key and value tensors in our case. And each of them will be like 64 dimensional vectors for every word. We then have the attention value matrices for every single attention head, which we eventually concatenate. And then we also help them communicate with each other. All right, so now that we've taken a look at multi headed attention and its code, let's now shift gears towards positional encoding. So let's actually walk through exactly how the initial part of the transformer neural network architecture works so that it kind of motivates positional encodings better. So first we have the sentence that we want to input in English that is my name is a J. Now, typically the way that transformers in all these machine learning models work is that they understand numbers, they understand vectors, but they don't understand exactly English words. And so what we would do is in order to make sure that we always pass in a fixed length matrix, we would pad the rest of all the words that are not present, present or not, with just like a dummy character or a dummy sequence input. And this would just be the maximum number of words that is allowed into the transformer. Each of these words is then just one hot encoded. Vocab size is the number of words in our dictionary that is the number of possible words that can be used as an input. We now pass this into a feed forward layer where each of these vectors is going to be mapped to a 512 dimensional vector. And the parameters here are learnable via back propagation. And the number of parameters would be the vocabulary size times 512 parameters to learn. Now the output of this would just be a set of 512 dimensional vectors, one for each input in the sequence. And it's to this that we're going to add some positional encoding, which is of the same size. And on doing so, we're now going to get another set of word vectors of the same 512 dimensions. Now for each word vector, we want to generate a query, a key, and a value vector, all of 512 dimensions each. And so we would pass each vector into, well, this query key and value set of weights, which will map one input vector to the output either query vector. This will map it to the key vector. And this will map it to the value vector. And we do this for every single word. And so the number of total vectors here would be three times the maximum sequence length because it's three for every word. Note here that these green transformations are basically like learnable parameters. In this case, it'd be 512 times 512 learnable parameters for each of these. Now it's from this point that we could probably split each of these vectors into multiple heads and perform multi-headed attention. But I think I've explained some of this concept in another video called multi-head attention for transformers. So please do check that out for more information there. However, for now, I kind of want to just focus more on this positional encoding. And I hope that this flow overall kind of just illustrates exactly where positional encoding fits in and how it fits in. So this is the formulation for computing positional encoding in the transformer. Now for, this is actually just to compute it for every single element of this matrix that we mentioned before. In this case, we have pos, which is the position of the word in the sequence. I is the index of the dimension and D model is the dimension length, which we have taken to be 512. So a big question here is why exactly do we formulate positional embedding in this way? The first reason is periodicity. Sign and cosine functions are periodic functions. So they will repeat after some point of time. What this means is that let's say that we want to look at this positional encoding and we're looking at it for this specific word. That is the third word. Let's say that at some point we're going to compute the attention matrix and try to determine how much attention this word should pay to all other words. Now during this phase, because of periodicity, this word is going to be able to pay attention to, let's say, 5 words after it and then 10 words after it and then 15 words after it in a much more tractable way. Now the second reason is constrained values. Sign and cosine will constrain the values to be between positive one and negative one. But without that, these values, at least in the positive direction, are not bounded. And so what this would mean is that positional encoding for this vector might be smaller than this next vector, which will be smaller than this next vector and so on. And during the time that we compute the actual attention matrices, you'll notice that the vector here is not going to be able to attend to vectors that are very far away from it and so it will not be able to derive any context from them. The last reason is that it's easy to extrapolate for long sequences. So this here is just going to be a very deterministic formula that's very easy to compute. And even if we haven't seen certain sequence lengths in our training set, we'll always be able to interpret them in our test set. And so because of easy extrapolation, it's used. Now that we got that theory out of the way, let's walk through some pie torch code to create some positional encodings. So we'll start by importing torch and then we define a max sequence length. This is the maximum number of words that can be passed into our transformer simultaneously. So in my case, the sequence length was four because it was my name is a J, but I'm defining the max length as 10. In reality, this would be in the thousands. D model is going to be the dimension of the embeddings. It's typically 512, but for illustrative purposes, I just use six. Now this top two is exactly the same formulation that I used to show positional encodings. But honestly, I'm going to rewrite it as just these two formulations so that it becomes easier to see and also easier to program. But they remain the same thing. Let's now start by coding out this denominator for when I is even. So for this, I would just use a range just to get a set of values between zero and D model that zero to six skipping two. So that's zero, two and four. And so we can now compute the denominator over here by taking 10 to the power of all of these values that we get divided by D model. And that's exactly what I do here. That's 10,000 to the power of everything that we just received here divided by D model. And so we get a set of values here. We now do the same thing for the odd dimension. So we compute the odd dimensions where it's from one to six skipping two. And so we get one, three and five. And then performing the same operation right here, performing the same operation. We get these numbers. Now, what you'll notice here is that the vectors that we got for the even denominator that is for this and the vector that we got for this denominator are exactly the same. And this kind of makes sense. You'll notice that the odd indices are one more than the even indices. And in the formulation, we always subtract one from the odd indices. So they effectively just became the same thing. And so instead of using an odd denominator an even denominator, I just am going to use one called denominator. And that will be used for both cases. Now let's just determine every single position for the sequence. So we can define every position by just taking all the values from one to 10 and then we'll reshape it to be a two dimensional matrix with the second dimension as one. And you'll get this two dimensional matrix here. One for every word. Now we want to divide every position with the denominator value that could be computed earlier. And for even cases, we're going to take sign. And for odd instances, we're going to take the cosine. And so this is going to lead to two 10 cross three dimensional vectors. So this is for even positions. And then we have the same exact thing for odd positions. Now what we want, though, is to interleave these two matrices. So for example, for the even position, we want this to be the first index. And then we want this to be the second index. And then the third index. And then the fourth index. But starting at zero. So it'll be the zeroth index, first index, second index, third index and so on. In order to do that, I basically stack them together while on the second dimension so that the two that we need to stack on top of each other are right next to each other. This will give us a 10 cross three cross two tensor. And we just flatten it. And effectively, we're going to be getting that interleavement here too. And so for our first word, this will be the positional encoding. For the second word, it's this one. For the third word, it's here. And so on. Now I've put together everything that we just talked about into this little cute class here so that it's reusable. And it has the exact same instances of everything that we just discussed. And we can now just make some calls to the positional encoding, passing the elements into the constructor. And then we just generate the forward pass, which will give us the exact same positional encoding matrix. All right. So that was positional encoding and its code. Now that we have a good idea of that, let's now shift gears to layer normalization. I kind of want to focus on these add and norm parts over here. What are we adding and normalizing? So in order to understand that, let's actually blow this encoder up. What we get is this not so complicated architecture that I've drawn by hand, but we'll actually just pick apart what's important here. We have the input and my name is a J. And let's say we want to translate this from English to French. We first pad this up with word tokens. Each word is then represented by some one hot encoded vector. Now technically, it would not be the words that are represented like this, but typically word pieces called byte pair encodings. But for the sake of simplicity right now, I'm just considering them as one hot word vectors. Because each of these are one hot vectors, they're going to be the same size as the vocabulary itself. That is all possible words that can possibly occur. We then transform these one hot word vectors into 512 dimensional word vectors. Because all the words are passed in parallel, there is no sense of ordering, but English sentences have words that are ordered specifically. And so we pass in some positional encoding to encode orders. We then add the input to the positional encoding in order to get these positionally encoded vectors. Now it's from here that the multi-head attention unit is kicked off, where each vector is now split up into three vectors of query, key, and value. Each of these are 512 dimensional vectors. And so we're going to end up with three times the maximum sequence length, because it's basically three times the number of words that we saw here. We now split each of these query, key, and value vectors into eight parts. And each part is going to be a vector for one head. And there are eight heads according to the main paper. And each of these heads are then now passed into eight attention blocks. And each of these attention blocks are basically going to just multiply the query and key vectors. And then apply some scaling as well as masking. Now, once we pass it through the attention unit, we will get attention matrices, which is going to be sequence length cross sequence length. That's the number of words cross the number of words in order to see exactly how much attention each word should pay to the other. We then multiply it by every head's value matrix, which is essentially going to be the sequence length cross 64. And so when we multiply these matrices together, we're going to get eight individual vectors, which are going to be the maximum sequence length cross 64 each. And then we just concatenate them when since there's eight, it'll be maximum sequence length cross 64 times eight, which is 512. Now, if we compare this to the original transformer image, this matrix is essentially going to be the output right at this stage just before the normalization. However, you'll notice that, you know, while this added norm actually takes the input of this matrix, which is like right here, you can see there's a connection, but it also has this residual connection over here, where it not only takes this matrix, but also the matrix after the positional encoding. Now, these residual connections are actually done to ensure that there is a stronger information signal that flows through deep networks. And this is required because during back propagation, you'll notice that there will be vanishing gradients, which means that there is eventually a case where as you keep back propagating, the gradient updates become zero and this model stops learning. So to prevent that, we kind of induce more stronger signals from the input in different parts of the network. And so coming back to our figure, you'll see that we will add this out matrix along with x dash. And if you recall, this x dash is exactly what we saw before, where it is the output of adding the input with your positional encoding, that is this matrix. And hence, you can see that we are adding two matrices and then normalizing, add and norm. So now that you know why we add and norm, let's actually get into the details of layer normalization itself. What and why layer normalization? Now to understand layer normalization, we break it part into two parts, which is normalization and then by layers. Normalization, what and why? Typically, these activations of these neurons will be a wide range of positive and negative values. Normalization encapsulates these values within a much smaller range and typically centered around zero. What this allows for is much more stable training as during the back propagation phase and when we actually perform a gradient step, we are taking much more even steps, so it is now easier to learn. And hence, it is faster and also more stable training to get to the optimal position or these optimal parameter values more consistently and in a quick way. Now layer normalization is the strategy in which we apply normalization to a neural network. In this case, we are going to ensure that the activation values of every neuron in every layer is normalized such that all the activation values in a layer will have like a center like zero and maybe like a standard deviation of one. To understand a little more detail, let's say that X, Y, Z and O are the activation vectors for every single one of these layers that we have here. Now in typical a neural network fashion, we would have some activation and apply it to the weights times that vector and add some bias. This is without any kind of normalization. But now to add this normalization we'll take exactly whatever the result is here. We will subtract the mean of the activation values of this layer divided by the standard deviation of the activation values of this layer. And we are going to ensure that we have like a learnable parameters, gamma and beta. This is the same gamma one and beta one is the same for this entire layer and gamma two and beta two are for this layer. Now gamma one and beta one and also gamma two and beta two are learnable parameters. So as we keep getting more and more inputs to this network over time and we perform like a back propagation step, these values are going to be learned and changed in order to optimize the objective of the loss function. Now let's actually go into more details by working out an example. Let's say that we have two input vectors. In this case, I've represented it as a matrix. So in this case, you can say it's like two words and each word is like represented by three dimensions. And let's say that we want to perform some normalization specifically layer normalization for this matrix. And to do this, we compute mean and standard deviation across the layer. And so the first mu one one, which will be the first part of the mean matrix is it's going to be just the mean of all the embedding values here. And you'll get 0.2. Similarly, the mean of all of these values over here, you get 0.233. And now we can use these new values in order to compute the standard deviation. So we'll take the square root of the mean of the difference of squares between every individual value of the embedding minus the mean for that embedding. So in the first case, it was like 0.2 was the mean hence I'm writing it here. And each of these individual values 0.2, 0.1, 0.3, which were the actual embedding values doing the math. We're going to get 0.08164 as the results. Similarly for the next input word, we have the input 0.5, 0.1, 0.1. And we'll subtract the mean for that specific embedding. We'll square it, summit, take an average, and then square root. And we'll get 0.1855. And so we have the matrices for the mean and also the standard deviation. Now what we can do is subtract the mean and divide by standard deviation. Doing that math, we'll get this matrix over here. And the output completely is going to be the learnable parameter gamma multiplied by every single element in this y matrix plus beta. Now what you'll notice at least initially if gamma is set to 1 and beta is set to 0, the out will be the same as y. And you'll notice that for every single one of these layers, you'll see that the mean is 0 and standard deviation is close to 1. Same with this, mean is 0, standard deviation close to 1. And so these values are just more tractable and it becomes much more stable during training. Now that we worked through an example, let's actually code layer normalization out. Now you'll notice here that I've added a batch dimension to the same exact input that we have. This is because during training, we typically would have a batch dimension so that it helps parallelize training and training just becomes faster. And so we would reshape the input to be the number of words, which is 2. In this case, I've taken the batch size as 1 and we're going to see the embedding for each batch as just 3. Because now we have like this batch dimension, we're also going to use layer normalization not only just for the last layer, but that last layer across some batches. In this case, it's going to be 1 so it's not going to make too much of a difference. But layer normalization is essentially going to be computed across the layer and also the batch just for your reference. So in this case, that's kind of why we see 1 by 3 dimensional matrices. Otherwise, we would have just seen just 3 dimensional vectors for gamma and beta. And we're going to initialize gamma to be the standard deviation, which is just 1s. Whereas betas are just going to be a bunch of zeros. Now I'm basically computing the dimensions for which we want to compute layer normalization. That is the batch dimension as well as the embedding dimension. And it's the last two layers. And now we'll just take the mean across the batch dimension and the layer dimension and we're going to end up with a 2 cross 1 cross 1 tensor. And specifically, we get the same values as before. We have 0.2, 0.233. We do the same kind of computation like we did before for computing the standard deviation. Now notice that we're adding some small epsilon value to this variance and this is done to ensure because standard deviation is going to be a denominator over here, it doesn't become 0. And so when we actually do the inputs minus the mean divided by standard deviation, we will get the same exact matrix that we kind of worked out by hand, which is great. And now we're going to multiply gamma, that is a matrix of 1s to all the values of y, and add zeros. You're kind of still going to get the same exact matrix. But this time, you'll notice this additional parameter over here, which means that it has learnable parameters. In this case, gamma and beta, which are going to be updating during the actual back propagation phase. Now everything that I've just discussed is going to be in this little class called layer normalization, where we have a constructor that takes parameters shape. This is the dimensions along which we actually want to perform layer normalization. It could be just the last one, like we worked out by hand, or the last two that we just saw in the execution above. Or even more if we wanted to. It's just a generalized function. Now you can kind of execute this code by just like initializing some inputs, where we see like some five cross three cross eight tensors. And then we'll perform some layer normalization here. We can, like I said before, we can either perform layer normalization for the last two, or just the last dimension itself. And you'll notice that we just get different values for each case. Now with this explanation of layer normalization, we've kind of explained a lot of these core components. And what I want to do now is take this architecture and blow it up and explain each and every single component in one detailed swoop. And to do that, I have like this everything here that's going to be a part of the series is on GitHub. And I've drawn out this entire architecture. And so now we'll be walking through the encoder and the decoder part of the transformer with all of these wonderful details. On reading the attention is all you need paper, which introduced the transformer architecture. It doesn't seem like it was written with the intent to be the foundation of BERT, GPT or language models moving forward. It kind of reads like a paper that less focuses on the architecture and more on accomplishing the specific task of language translation. And so we're going to also blow up and explain the same architecture using language translation from a language called English, which I'm assuming we all know, I guess. So maybe I'm no judgment if you don't. And then to a language called Kanada, which I'm assuming you don't know because it's a very small South Indian language spoken in a very specific state in a very specific country in the world. But I'm from that region. So I thought I would just create a translator to do so. Let's now talk about the high level of the transformer architecture. The transformer architecture consists of two parts, an encoder and a decoder. During training, the encoder will take the English words of the sentence simultaneously and it will generate word vectors simultaneously. These word vectors are going to be eventually context aware as we will continue to train the transformer. And this is because of the attention mechanism, which we'll get to shortly. To the decoder, we're going to take care of our Kanada words simultaneously, but we'll also pass in a start token to indicate the start of the sentence and the end token to indicate the end of a sentence. We'll also pass in the English vectors that were generated from the encoder. For the labeled output of the decoder, we shift the translation to the left. This is because given the start token, we want to predict the first word of the sentence. Given the start token and the first word of the sentence, we want to predict the second word of the sentence. And in case you don't know where this is going, if we're given the start token and the first two words of the sentence, we want to predict the third word and so on. To determine the model's loss, we compare the prediction made with the softmax and the labeled sentence here. A common loss function used for this kind of problem is the cross entropy loss. We would compute the cross entropy loss for every predicted word. Add this up to get a single loss. This loss is then back-propagated through the network to update the network's parameters. Now, during inference, let's say that we want to translate the English sentence, how are you, to Kanada? We don't have any contextual words, but we do have the English words. So we can generate the English word vectors from the encoder, but we only pass the start token of the Kanada translation to the input of the decoder. This will produce the first word in Kanada translation, which is nivu. That's how you read this. We then take the start token and nivu to generate the next word of the sentence, which is heigididi. We now pass in all of these words generated and continue generating words until we hit the end token, and this will complete the translation. Now that we took a look at how training and translation inference happen at a very high level, let's blow this architecture up and look at some details. Before starting this out, let's lay out some hyperparameters that we're going to be using through the conversation. So we're translating from English to Kanada. We'll assume that the batch size is 30. The batch size means that we are passing in 30 sentences at once through the network in order to update the weights of the entire network once. And we then have the maximum number of words that a sentence can possibly have is 50, and this applies to both languages to keep it simple and consistent when we're seeing in the diagram. I mentioned we have a batch size of 30, but why are we batching? This is for faster training. When training neural networks with gradient descent, we pass a single input, generate an output prediction, compare the prediction and the label, and quantify this as a loss. This loss is back-propagated to update the parameters of the network. And so for every training example passed into the network, the parameters are updated. This is fine for small networks since the updates are quick enough, but for networks like ours, which have like millions of parameters, the updates can be pretty slow. So instead of updating the weights after every single sentence, we will update the weights after seeing a batch of sentence examples. And hence, many batch gradient descent is of common practice. And in our specific example, we are passing in 30 sentences at a time through the network before any update to our model weights happens. Now, why are we choosing 50, though, for the maximum number of words in a sentence? This is so we have a consistent shape of tensors that pass through the model at any given time. There are many implementations too that dynamically size the input, but we'll stick with the same sequence length for every single sentence implementation for this video. Now, let's come back to the encoder. So this is the start of the encoder where we have every single word over here. However, computers do not understand words. They understand vectors and numbers. And so we'll be converting each of these words into an embedding. Specifically, embeddings are vectors and they are like 512 dimensional vectors in our case. So this cube structure over here is essentially a huge tensor that is going to be 30 cross 50 cross 512. That's 30 for this batch dimension, 50 for the number of words in the sentence, cross 512 for each single word representation. We then add a positional encoding of the same shape. These positional encodings are generated from sine and cosine functions. And so they consist of numbers between negative one and positive one. They exist because the encoder takes in words simultaneously, but the ordering of these words actually matters. We have the first word followed by the second word followed by the third word and so on. This ordering or positional encoding is defined by a positional encoder. In the main paper, there are not learnable parameters, but they can be configured to be learnable parameters if need be. There is no evidence at the time of making this video to suggest that using learnable parameters is better than just using the sine and cosine functions to generate positional embeddings. I have more information on this in another video on positional encodings with code. We then get this final tensor over here that we pass through a feedforward network in order to get query vectors, key vectors, and value vectors. So every word is split up into three vectors, a query vector, a key vector, and a value vector. In past videos on this topic, I have discussed a very specific differences between them, but this is honestly just for our understanding and the difference isn't very explicit to the model itself. I've received a number of comments that dig a little too deep into the exact definitions of these three query key and value vectors, and we could construct some difference between them, but in practice they could quite literally be the exact same vectors for the encoder case. And this is best represented with the encoder architecture as the query key and values are merely the result of a feedforward transformation, and they're really nothing special. Now what makes the query key and value vectors different are the operations that we will eventually perform on them. Since every word is represented by three vectors, so we have every 512 dimensional vector converted into a 512 dimensions times three, which is 1,536. So every single one of these 1,536 dimensional vectors now constitutes a word. And the reason we break this up is to perform the multi-head attention. Zooming out a little bit over here, you'll see that there's this huge white stacked block. So there's going to be eight of these stacked blocks on top of each other, and I do that for eight heads in this multi-head attention. This is mass multi-head self-attention. It is self-attention, because we're trying to analyze the context and build context within the same sentence. So every word in the sentence is compared to every word in the same sentence, hence self-attention. It's multi-head, because we have eight of these layers, and then it's masked, because we're going to be discussing the padding mask over here, since there are many sentences that are not 50 words in length, and so we would be adding padding tokens. But these padding tokens should not be considered when computing a loss and also performing back propagation. And so we want to effectively mask these out. The logic you see in this one grid is just for one attention head. And like this, you can imagine that there are eight parallel processes going on. So this 1,536-dimensional word vector for every single word is going to be broken up into eight pieces. So we have this query part, which is going to be broken up into eight pieces. The key part broken up into eight pieces, and the value part broken up into eight pieces. Each piece is essentially going to be a 64-dimensional vector if you do the math. So that's 64-dimension of the query, 64-dimension of the key, 64-dimension of the value, and we stack them kind of in this way to get 64 times 3, which is a 192-dimensional vector representing every word for one head. What we do is take the query vector and the key vectors, and we will multiply them and using these shapes. Of course, it's 30 cross 50 cross 64, and 30 cross 64 cross 50. And if you do a matrix multiplication, this will be a 30 cross 50 cross 50. Specifically, we're doing the multiplication along these last two dimensions and not the batch dimension. The fact that we're comparing every word in the English sentence, which is the query vector, with every word in the same English sentence, which is the key vector, makes this a self-attention. This is the semblance of a first attention matrix. And typically after this, even before we add the padding mask, we might even do some sort of scaling so that we prevent all of these values here from the multiplication from exploding to much, to high numbers or very low numbers, and thus also stabilize the training. Scaling simply means dividing every value in our self-attention tensor by a constant value, and the value that is used in the main paper is the square root of the key dimension size vector for one head. And since our key is 64 dimensions for one head, we would typically divide each value by eight to scale. This will ensure the activation values are neither too large nor too small. But once we do that scaling too, we're going to add a padding mask. A padding mask will prevent the padding tokens from propagating values. Technically, they should be zero for pass through and negative infinity for masking. This is because we will eventually perform a softmax. The softmax operation uses exponents. So the exponent of zero becomes one, which is a pass through, and the exponent of negative infinity becomes zero, which will mask or will block information from passing through. In practice, the way that we would accomplish this is for every single padding value, we would introduce a very low negative number, kind of like negative 10 to the power of nine. It can't be negative infinity because a future softmax operation might fail because if every single row is going to be a negative infinity and we apply a softmax, you'll get zero by zero, which leads to numerical instability. And hence, what we would do instead is apply a very low negative number so that there's going to be at least some form of pass through that occurs even after doing all the math. But this pass through is so negligible that it shouldn't really affect your training too much. So once we apply the padding mask, we will now apply a softmax activation, and then we will actually have an attention matrix. And in this attention matrix, you'll see that it's 30 cross number of words, cross number of words, number of words, cross number of words. And so you'll see like a probability distribution for every single row over here. And so each value in this matrix is going to quantify how much attention each word should pay to every other word. We then apply the value matrix that we computed way back in the beginning over here in order to get a 30 cross 50 cross 64 dimensional value tensors. And these value tensors will be very contextually aware. And this is the output of just one attention head. But recall that we have eight of these attention heads. And so if we concatenate them one after the other, we're going to get 64 times eight of these, which is 512. And so this concatenated tensor is now a very much so contextually aware tensor, at least compared to the input to this entire multi head attention unit. You'll see here now that we're adding a residual tensor. When performing back propagation, the loss value propagates in the backward direction to change the weights. You'll notice the largest changes happen towards the end of the network since it's closest to the loss signal. But as propagation goes on in the backward direction, the change in parameters will continue to get smaller. This is okay for shallower networks, but as networks get very deep, we might run into situations where the loss might not be able to propagate to all parts of the network. And so the parameters don't change as the gradients vanish and no learning happens. Skip connections or residual connections help by enabling better propagation of the inputs in the forward direction and the loss in the backward direction. We see this a lot in research involving deep convolutional neural networks and they're used here too. So once we add a residual tensor, we'll now get this new tensor that hopefully will carry out the activations and the weights and also make sure that these weights update. We then perform layer normalization. The goal of normalization is to ensure stable training. So activations during the forward phase and the gradient updates during the back propagation phase are not too large in magnitude. Mathematically, normalization would involve subtracting the values with the mean and dividing by a standard deviation. In batch normalization, we normalize the values across the batch. That is the 30 dimension. But in layer normalization, we normalize the values across the feature layer that is the 512 dimension. And so for layer normalization, we subtract each value of the tensor with the layer mean and divided by the layer standard deviation. Now, in practice, we add a very small epsilon term to the standard deviation, which is of the order 10 to the negative 5 in order to prevent a 0 divided by 0 case from occurring. Additionally, we also optionally use learnable parameters called gamma and beta to track changes across different training examples. I've explained the process with more details and with examples and code in a video on just layer normalization too. We now take the output tensor, pass it through a feed forward layer and then pass it back through a feed forward layer just to capture some more information. And then we'll perform the same addition of the residual tensor and layer normalization in order to finally get a set of for every single word, we will have 512 dimensional tensors and each of these tensors is going to be very well contextually aware. And this here is the output of the entire encoder architecture. Now, once we have very contextually aware items here, we're going to pass it through a feed forward network, but I'll explain exactly what and why we have key and value vectors here. But before I do this, let's go back to the decoder architecture and explore the beginning of it. So with the decoder, we want to generate the output translation. And for this, we'll pass in the input sentences with a start token followed by the sentence values followed by the end token and then a bunch of padding tokens until we hit 50 words. Now, in much the same way as before, we add a positional encoding. And once we add a positional encoding, we're going to pass it through a feed forward network to get query key and value vectors and perform it mass self attention as we did in the English case. So that's mass multi head self attention here. So it's the same dimensions. We'll take the query key value split it up into eight different heads, multiply the query and the key tensor. We have an attention matrix essentially over here, which we will typically scale before adding a mask. However, in this case, we not not only need to add a padding mask, but we also need to add a look ahead mask. So a look ahead mask will ensure that the decoder is not cheating because during the generation phase that is the inference phase when you want to actually just start translating sentences, we don't have access to future Kanada words, the future destination target language words. And so during training, we need to apply a mask to ensure that it does not look ahead of its current self. So that means that the third word in the sentence cannot look at the fourth word in order to see what it can attend to. And so we cannot derive any contextual information from that during the training phase. And so we will add the look ahead mask along with the padding mask. We then apply softmax to actually get attention values and probability distribution like values for every single word, how much it needs to pay attention to every other word in that sentence. And then multiplying it with this value matrix, we'll get some value tensors, which we concatenate across all of the eight batches over here in order to get the final concatenated tensor. We'll then add a residual tensor so that we can ensure that information is going to propagate. We apply layer normalization. And now we are going to get this batch over here, which I'll just call it directly as Q. This is going to be a set of query vectors that we're going to be inputting to our mass multi-head cross attention layer now. So in order to perform cross attention, this is going to be instead of like every word in the sentence to every word in the same sentence. So the every word in this target sentence, Canada to every source word in the English sentence. And so the query query represents essentially what am I looking for? And that's kind of what we would want to output. That's why this will be the query. The Canada words would be the query tensors. And now if you can see that there's an arrow that's coming in, but it's coming in from the encoder piece, which is providing. Well, we saw these concatenated tensors that are very high quality. We're just going to put a feed forward network of 512. It'll map 512 dimensions to 1024 dimensions. And so every one of these is for every word will be 512 and 512 for the key and value tensors. This will add English information that we have encoded into the Canada vectors so that when we're generating the Canada vectors, we generate an appropriate Canada translation based on what the English translation has to say or the English sentence has to say. So we have the query from the Canada sentence and we have the key and value vectors from the English sentence. And then we perform multi head cross attention, which kind of looks like the other self attention case. But it's clearly the source of the query is different and the key and values are different as well. So multiply the query and key, get an attention matrix, which we will scale as I mentioned before to help numerical stability. In this case, we only need a padding mask. This is because every single Canada word is allowed to see the entire English word sentence because we are during translation phase, we have everything in the encoder. We have all of the English words already since we need to translate it. And so we just need to add a padding mask just to zero out any padding information from excess tokens. We'll then perform a softmax to get a probability distribution values of how much attention should each Canada word pay to an English context. And then we'll eventually get the similar value tensors, which we concatenate across the eight heads. And this will lead to a 512 dimensional vector for every single Canada word. And in this case, every Canada word will have some English context now embedded in it as information. We'll add a residual tensor to ensure that we have extra propagated information throughout the network because it's a very deep network. And then after performing some layer normalization for stabilizing the values and gradients and a feedforward layer, we will end up with a 512 dimensional vector that is very context aware. This entire decoder layer and also the encoder layer can be repeated. I put like a times n up here and also a times n over here. So essentially after getting these tensors, we can actually create this entire architecture. It'll basically, the control will basically pass back here. And it has the same tensor shape. So we can basically perform this decoding phase again and again as many times as we need. Typically it's done like a few times, even once is fine, but it depends on your problem, depends on how many patterns are required to actually understand this translation. But essentially once you have your final vectors, that is your final Canada vectors that have English context embedded with them, you can take it into a feedforward layer in order to expand it to the Canada vocabulary size. Note the Canada vocabulary is the number of possible words that our model can see and predict. In this architecture, we made it predict words. So the dictionary might be large and of the order of tens of thousands. However, the sentence sizes are small. You'll see the English sentence, how are you is just size three and the Canada translation for this is just of size two. And many implementations use character tokenizations or even byte pairing codings to get a good balance of vocabulary size and maximum sequence length. So in this case, for every single batch, for every single word, we are going to have a vector that is the size of the Canada dictionary. And this is so that we can make it interpretable to us humans. So once we apply a softmax on it, we'll get a probability distribution across all Canada words. And we can just take the most probable word that it's supposed to be as like the prediction. And this is going to be compared to the labels. This here is the label, which is what it should be, which what every single line. So the first line should be encoded as this first word, which is NIVU. And the second line over here should be encoded as the second word, which is called Hegi Didi. And this third line should be encoded as the end token to represent the end of sentence. And it's based on these three tokens, the first three ones, that will be computing a cross entropy loss. And then we'll be back propagating throughout the network. And when I say throughout the network, we're basically going to be back propagating through the encoder and decoder as well, because the encoder also has a connection over here. So all of the weights of the entire network will be updated once it sees one batch of 30 sentences. And so that's the entire transformer neural network architecture. And I hope you kind of got the entire context here. We can go through the inference part two. So the English sentence is passed through the encoder as we would during training to generate the context aware vectors. To the decoder, we pass the start token followed by the padding tokens. And we ensure the padding tokens are all masked in the multi-head self-attention and the multi-head cross-attention. We'll get a probability distribution for the next words. But we just need the first rose probability distribution in order to just get the next word. From this probability list, we can take the first word corresponding to the largest probability or even perform some form of sampling to generate the next word. And fun fact, the chat GPT model makes use of this sampling technique. So the same words aren't always generated in verbatim. It adds a more human element or aspect to it. Once the first word of the Kanada translation is generated, we pass this word along with the previous start token as input. We update the paddings and mask for the self-attention and cross-attention components and generate the second word. And we repeat this process until the end token is generated to get the complete Kanada translation. Now that we've taken a look at this complete architecture and all of its glory, we're actually going to code out each and every single one of these components in Python. And to do so, let's start with the transformer encoder. To help me kind of explain all of this, alongside the code, I also have a runnable Google Colab Notebook, which I've executed and have run on just some sample input just to get an idea of which layers are run and how large the shapes are for every tensor as they pass through them. But don't worry, I'll probably be explaining these bit by bit. So in order to get started, let's just start with some basic parameters that we're seeing here. D-Model is going to be the size of every single vector throughout the encoder architecture. And what I mean by that is we have, for example, in this transform architecture, we have my name is a J. Let's say that we want to translate from English to some language French. We'll be passing all of these simultaneously through the transformer encoder. And eventually we're going to get different vectors here for every single word. More technically, this would be a word piece or it could be characters, but I'm just saying words as just an example here. Now, the size or the number of numbers for every single one of these vectors is going to be 512. And not only that, but if we kind of blow up this architecture, we're going to see that, let's say we have my name is a J. Each of these is represented by vectors, which will inevitably become 512 dimensions. And we also kind of see this throughout that all of these vectors, you'll see whatever we do, we'll add eventually something called positional encodings. It'll also be 512 dimensions. And this is going to be like three times 512. So everything, every vector that you'll see kind of throughout the encoder architecture will be somewhat related to 512 dimensions. And this is the reason why I've just defined it as a parameter here. The next is a number of heads. So the number of heads specifically is used in the concept of multi-headed attention. Now in the transformer neural network architecture, we see these multi-headed attention units. Well, essentially when we're performing the concept of attention, we are actually going to perform it eight times in parallel. And this can be represented in this figure over here, where you'll see that we have, we'll construct like, let's say query vectors, key vectors, and value vectors over here. And we'll get all of these vectors. And then eventually we will perform, there's going to be eight operations that we perform simultaneously, where we take the softmax and perform all of that good stuff with respect to attention. So essentially you can see, consider this as the number of parallelized operations that we perform within the encoder. Next is drop prop. Now throughout this encoder, we're going to be performing something called a dropout, where we're going to randomly turn off certain neurons. And what this allows the neural network to do is it forces it to learn along different paths over here. And thus make wait updates accordingly. This will help the neural network be better able to generalize data instead of accidentally memorizing specific data. And effectively this acts as a regularizer, if you've heard the concept of regularization, for neural networks. And it's pretty useful when it comes to very deep networks with a lot of connections and parameters. Now I've set this probability to 0.1, which basically means that there is a 10% chance that a given neuron will be turned off on a given stage. We can adjust this value to be anywhere between zero and one. Now we have batch size, which I've just set to as 30. So when we're dealing with neural networks and we want to perform some training, typically we would pass in multiple examples at the same time. Those multiple examples constitute a batch. And there's actually a couple of reasons why we do this. So first of all, it's faster training. And second, it's also more stable training. Now let's take a look at this diagram over here where this is the contours of a loss function that we see as the eccentric circles. And we want to get down to the red point. So you can imagine these outer contours are higher. We want to kind of get to the lower point. Now if we had performed, looked at one example at a time, pass it through the network, perform back propagation with the weights update, then that would just be one update. And we would need to do this for every single example. And that's where you get this purple curve, which is stochastic gradient descent. Where some examples may be good and so they'll lead you to the loss, but then the next example may be bad and lead you away from the loss. And it becomes this very noisy gradient update. And we need a lot of them in order to converge to the eventual red point. Now on the other hand, if we had taken batches of data, like if we had basically batched our entire, all of our data together, put it through the network and be like, first look at all of these examples, and then you perform your back propagation, then we would have led to this blue curve right here. Now that could be a lot of data to process at once. So typically a good middle ground for many of these machine learning and deep learning problems is to use mini batches. So do somewhere along the middle where we will batch some arbitrary number of examples together in order to just learn effectively quickly and also in a way that's more stable. So that you, every time that, you know, you will see examples, you will see that the loss on average will always decrease. And so in this case for now, going back to our screen, we're going to be, it's saying that we are going to look at 30 examples of let's say 30 sentences in some English language. And it's only then that it's going to propagate through the entire network that is go through the encoder and the decoder. Then we have a loss function, which is going to be computed. And then the gradients are going to be calculated in the reverse direction and we'll have gradient updates. All the parameters will be updated only after seeing 30 examples. And so we have like a mini batch gradient descent that we're going to be performing. Next is max sequence length, which is this is the largest number of words in our case words. Let's just say that we're translating from English to French. So it's the largest number of words that we can be passing in at a time through the encoder. Now, in reality, this is always going to be the number of words that we pass in through the encoder. Because if we actually look at the blown up architecture, if I just go all the way to the beginning over here, you'll notice that if we pass in my name is Ajay, as the input English sentence to the transformer encoder, we're going to be also passing a lot of padding tokens here. And this padding token will be, let's say, you know, if there's the maximum sequence length is like 200 words, and the sentence is only four words, and there'll be 196 padding tokens here that are just going to be passed in. And this is always going to be the case where we'll have a sentence, and then we will add some padding tokens such that the maximum sequence length is achieved. So there's always a fixed length input for any kind of sentence that we decide to input to our transformer encoder. The next is FFN hidden. Now, when we actually look at the architecture here, we have like a feed forward network. While most of the cases, like throughout the entire encoder, and even like through almost all the entire decoder, we will have 512 dimensional vectors like I mentioned before. It's only at this step over here that I'm going to be expanding the number of neurons at some point to be 2048 before making it eventually back down to 512. And this is simply to just learn additional information if and while we can, like any other feed forward layer is designed to do. Now, numb layers over here is the number of transformer encoder units that we want to include in our architecture. And if you look here at our diagram, you'll see that there's this n times whatever this is. So this is the n is the number of transformer layers because these are typically repeated. So we have inputs going in, we'll have it pass through the encoder layer, and then we'll have it pass through another encoder layer and then another encoder layer and then another maybe as many times as you want to. And then it's pass into the decoder, which is also repeated in this case n times. So we will repeat it again, like in this case, it's five times, we will repeat the decoder five times before eventually coming out with an output. And so in our case, I've defined n, the number of layers as five. And this can vary depending on complexity. So you can change this to a higher number if you have a lot more data and also it can pick up more complex patterns. Otherwise you can also keep it pretty low. Now we'll take all of these values and we're going to create an encoder object by passing it into the class defined as encoder, which I will scroll up to, and passing all of these values. So let me just take a look now at that same code over here because I think it's just easier and cleaner. So we have this class called encoder. And in this case, we have a, this is the constructor and this is a forward method. Now every class that we consider here as a part of a network typically will derive its subclass from module. So module is like the superclass, the class in which it inherits from. Now the reason why we will inherit from this module class is that it allows us to perform many operations behind the scene that are required for learning. So for example, for one, it actually provides its own forward method and it also helps us, you know, take our tensors, for example, and it will, we can add it to, for example, CPU and CUDA where we can do like CPU where we might need to take our tensors, put it on CPU if we're doing inference, or we need to put it on CUDA for if we wanted to do some model training. And it also helps us access like parameters a lot better where we can take, for example, let's say, let me go here and see, we can get our parameters and also even modify them for initialization purposes. So module basically provides a lot of that bootstrap code for us that we don't really need to worry about. It also helps us, you know, if we want to save checkpoints, it's easy to do so using these modules where we've trained, let's say, for like a hundred iterations, we want to save the model. Modules, defining modules or extending classes from modules makes it very easy to save the state. And for those reasons, we would want to use modules. And so we extend modules here. And because we extend modules, we want to override the forward method. Now in our constructor, though, right here, I've defined this sequential here. So this sequential unit is essentially going to take all the comma separated values here. And it's going to execute them in the forward pass one at a time, one after the other. Now, the way I've defined this is, well, for the encoder, the encoder contains multiple encoder layers. In this case, it contains like five encoder layers because here in num layers is going to be five. And so I will call encoder layers, which I will explain very soon. And I'm going to execute this five times. And here you'll see I'll have a star in front of a Python list, which will basically take the list and it will deconstruct it into its component five elements. And then we're passing it into sequential. So now we have a sequence of five encoder layer objects. Now we are overriding the forward method. And when we override a modules forward method, then what happens is that it will take an input and it will propagate that input for the forward propagation step, where we will pass it into our layers, which we've defined here. So essentially it's passing it through all of the five encoder layers. And then we'll have an output over here, which I've just overridden as X and we are returning that value. And so that will be the overall output of the entire transformer encoder. And so I hope now you understand like what's going on now at a high level. We can now dig deeper and deeper and go with the lower level layers. So for example, what is this transformer encoder layer that we see here? So to see that, we'll scroll up. Now in our encoder layer, we have a constructor again and a forward method, which we use for the forward pass. Now in the constructor, we'll first start every constructor out by calling the superclasses constructor so that it defines everything in the module. So all the module components are now set up. Now multi-head attention is going to be performing the multi-head self attention of the encoder. We then have layer normalization, which I will explain shortly. We have drop out and then we'll have the feed forward layer. Then one more layer normalization, which I am explain, you know, which I have defined individually. And then another drop out. I've arranged this forward pass and I've arranged all of them in a way that it's kind of very similar to looking at this diagram. So you can just see this rectangle is encoder layer where if this is x, this input is x, you see that first, we will take a residual value. We'll just save it because we need it later. We'll then pass it into the multi-head attention followed by add normalization. So we'll take the residual x then we will then pass it into the multi-head attention. In this case, I'm passing a value called mask to the attention layer and set it as none. This is because like the process of attention requires paying in self attention specifically. Like let's say that you have all the words in the English sentence, it's going to pay to all pay attention to all of the other words. Since all of these words in the encoder are passed in simultaneously, we don't need to pass in any masking. There's no need to say, oh, we can't look at some words in the future because we kind of already have all the words of our English sentence if we wanted to translate it to some other language. And so I'll explain this in a bit, but for now the mask actually is just set to none because we don't really need a mask or it's optional. And then we'll perform dropout which is randomly turning off the neurons. We will then add the residual connection to this current value and then we'll perform layer normalization. So this is the equivalent of add and norm that we see over here. We'll then take this output whatever this is x again, we'll pass it through a feed forward and do some add and normalization. So let's take the residual, pass it through a feed forward, do some dropout just to randomly turn off the neurons. And then we perform add and norm. And eventually we'll now get the output of our encoder layer. And so I hope that looking at just this is a good representation of what this image here looks like. Now that we have a little bit of a higher intuition of like the encoder layer, let's actually pick these individual units apart and see what they're really doing. And let's start with the biggest multi head attention class over here. Now in order to understand multi head attention, I think it's actually best to start with the scale dot product operation, which is like the single head attention or the core or crux of the attention mechanism. So let's get started with this. Now if you look at the original paper here, and I'm going to scroll down to this section, you can see that the core attention matrix can be represented mathematically as a soft max of, let's say this Q vector, which is a query vector times the key vector will scale it, and then we'll perform a soft max and multiply it by a value vector. So this attention matrix, essentially when we talk about query key and value, essentially every single word is going to be broken down into three vectors. That is a query vector, a key vector, and a value vector. Now the query vector is what am I looking for? The key vector is what do I have to offer? And then the value vector can be considered as what I actually offer on attention. So I think the distinction between the key and the value, it seems a little fuzzy, but I think if we look at the code, it's going to be a lot more clear. Now this here is the attention function where it takes in a query key and value and an optional mask. Like I mentioned before. Now in this case, our query vector for one attention head is going to be 64 dimensions. And so this will be like 64. I'm going to write this out entirely when we're just doing a pass for just looking at the layer sizes themselves. But for now, just say that this is just a constant and this DK is like some value called 64. We then now perform the matrix multiplication of the query and the key. Now this key vector is actually not just for one word, but it's for all words for this one head. So this might be a tensor instead. So this is going to be like, it's going to have like a batch followed by sequence length and then some encoding value. And we need to hear only perform the transpose by only flipping the last two dimensions. To understand this operation, let's say that we have like some tensor that's 30 cross 200 cross 512. Now in our case, this is the batch dimension cross the maximum sequence length cross for every single word. What is the embedding dimension length? Now if we just do a transpose of this, we'll notice that all of these values are just inverted. All of these shapes are completely inverted. But if we use a transpose negative one comma negative two, it's only going to take the last two dimensions over here and flip them. And that's why we have a 30 cross 512 cross 200 instead of a 30 cross 200 cross 512. And so that's exactly what's happening to the key vector over here. And now we multiply both of these in order to get the attention matrix. So essentially for your reference right now, this is going to be a max sequence length cross max sequence length tensor along with an additional batch dimension. And we aren't scaling this value with respect to a constant, which is the square root of dk. Now the reason why we're doing this, we can kind of see in this collab notebook. So let's just say we have a query, a key vector, and then the query times the key vector. And for each, we're going to determine the variance of the examples that are within them. So in the query vector, it's like 0.6 say it's kind of close to one, kind of close to one. This one's well above one though, the query times key. That variance is pretty large. But and also the mean is closer to zero for all of them. Well, you know, just keep, keep in mind this is negative 0.04. Now if you scale these values though, you'll notice that, well, the query and key nothing changes, but the vow, but this query times key now also becomes, has a variance of the order of one. And it's mean also actually has a variance that's slightly, it's a four times smaller than this. It's closer to zero. And so what scaling allows us to do over here is to ensure that the variance of values within these matrices have kind of like a mean zero standard deviation of one. And this allows us for easier and stable training. Now what I mean by this is, let's say that we perform back propagation, right? So we did a four propagation, we have some values, we now do some back propagation. And there are going to be, if there are extremely large values here, that can affect the size of our gradient step. And so what we want to do is make sure that these values are as nominal and normalized as possible so that we have stable steps that we take during training. And we won't have like obscenely large or obscenely small values that are propagated throughout the network that affect these gradients negatively. And so it's just stable learning. Next we have an optional mask, which you pass through a scale tensor. Note that when we're passing a mask, we don't need it for the encoder. But in the case of a decoder, there are situations where, you know, we have an input where we pass in simultaneously. But technically, we don't know what's going to be the output for the next word because we're generating them one at a time. And in order to deal with this, what we'll do is we'll create a mask that kind of looks like this, where we'll have a mask that has, like it'll have like, let's say this is for the output of a French sentence, where let's say that it only has four words. So this would be, you know, some output French sentence here. And this would be four words of the output French sentence of self-attention. Now in this case though, in training, we have access to all of the data. So we can see everything, but we shouldn't be using all of that data during training because that's considered cheating because during inference time, we don't know what the next word is going to be generated. When I'm generating the first word, I only know about it. When I'm generating the second word, I only know about them. And that's why you can see that when I know the first word, I can only perform this like, attention for that first word. For the second word, I can perform for the first, I can pay attention to the first two words. And then I can pay attention to the first three words for the third word and pay attention to the first four words for the fourth word. And so this is going to be more relevant when I talk about the decoder in a separate video. But for this video, you don't need to worry about this mask at all. It's also considered an optional mask if we wanted to pass it through the encoder. Next is that we are going to then perform attention. So when I say attention here is that we'll take all of the values of this scale and then we are going to perform a softmax so that we get probability values of how much should we focus. And the way that looks here is that we'll have initially this, let's say this matrix right here. This is the scaled matrix with applied mask in this case. But if we apply the softmax operation, you'll see that all of the rows will add up to one. And so this is basically saying the first word should focus 100% on this word. The second word should focus 51% of the first and 48% of the second. And so on for the third word and the fourth word. So now these are like interpretable probabilities of how much we should be focusing on. And then we're going to multiply all of these attention values which are now like probabilities with the value matrices. So this is now going to give you a new set of tensors for every single word. Now what these new values do is for every input word we'll now have a new value vector and that value vector will actually have all the information associated with context. It will know how much attention it needs to pay to all of the other words in that sentence. And so you can consider this value vector for the words to just have more context awareness and hence be of higher quality than just the inputs. Now this here is the code for one attention head. But when we're actually executing the forward pass for multi-head attention you can see that that only happens at like one stage here. And we kind of have to prepare the entire input in order to split it up into these multi-heads. So what are we doing and how are we doing that? So I'm just going to put some numbers here because I think it's going to be useful where we have this D model is going to be 512 dimensions. The number of heads this is going to be 8. Now the heading dimension here is going to be 512 divided by 8 which is 64. Then we'll have the KQV layer which is going to be a linear layer of D model cross 3 times D model. The linear layer is basically like a feed forward layer that is used for propagation. And in this case typically in theory let's go back to our encoder architecture over here. In theory we would have like all of the input vectors and we will split them up individually. Each 512 dimension vector we would split it up into query, key, and value vectors and then perform their operations. However in code what we would do is we would kind of like perform all these operations in parallel but within the same tensor itself. So they're not like three individual vectors but they're all just three like stacked tensors. And so what's going to happen is that this is going to be 512 cross whatever 3 times 512 which is 1,536. And then this is also going to be an just a feed forward layer 512 x 512. And so when we pass when we have like a forward pass where we're going to have batch size cross sequence length cross D model this is the input. So in our case it'll be what is the batch size? This is going to be like 30 cross 200 cross 512. We then pass it through the query key and value layer which is up here. And so we're going to get for every single one of these we are going to get for every single word that we have instead of we're going to have three vectors of 512 dimensions each. And so that'll be 3 at 30 cross 200 cross well instead of 3 times 512 it'll be 1,536. Then we're going to reshape it though to be batch size which is 30 sequence length. Now we want to break out this last dimension into these parts because right now we have the query key and value vector but we want to break it up into eight heads. Number of heads is eight times the heading dimension over here is three. It's going to be 64 so it's three times 64 for every query key and value vector which is 192. So we've now broken up the query key and value into eight heads. And now I'm just going to switch up basically the dimensions over here. So it's 30 cross eight cross 200 cross 192. And that's how it's going to look because we're switching these two dimensions here. Now what this chunk will do here is it's going to basically break this entire tensor of this shape into three parts and the way that it's going to break it up is in accordance to the last dimension. So we'll have now QKV vectors where let's just say each are 30 cross eight cross 200 let's just put a cross over here cross instead of 192 it'll be 192 divided by 3 which is 64. That's QKV each of them are this. Now let's actually take these values over here and if we pass it now into our scale dot product attention let's revisit this just to see how what we're actually looking at. So just to see this let's say QKV and they are going to be 30 cross 200 cross what did I say here that would be it would be 30 cross 8 cross 200 cross 64 so let's just copy and paste that right over here. Now what we're doing here is this last size we're going to say we're going to get the shape of Q which will be returning this value and it will return the last part of that shape which is 64. So DK is 64. Now when we scale this this is going to basically take the square root of 64 which is 8 and we're going to perform a matrix multiplication of these two operations here. Now when we perform a matrix multiplication it's going to be of 30 cross 8 cross 200 cross 64 and then we're transposing just these two layers over here so it'll be 64 cross 200 and so we are going to end up with a 30 cross 8 cross 200 cross 200 dimensional tensor. Now this is the batch size the number of heads and now this is going to be our little precursor to our self attention matrix because it's number of sequence that's a max sequence length cross the max sequence length so this is the precursor to our self attention matrix. Now this mask here I've said broadcasting to add because in PyTorch you actually don't need the exact same dimensions that you're adding when you're adding tensors. So for example this mask is actually going to just be probably a 200 cross 200 but because we're adding in this way we are actually going to be PyTorch is pretty smart in that it's going to just add this 200 cross 200 matrix to every batch and every head and so it'll apply it everywhere in parallel that we need to so essentially we're going to end up eventually after this with the same matrix even if we were to do you know some masking we'll still end up with the same shape. Now attention here it's just applying a softmax operation and like we saw before if we kind of scroll to the self attention here you see that this was the before shape and then after you apply attention you'll see that this is the after shape and nothing has changed no shape has changed at least except the values themselves have changed so we'll come back here and it's essentially going to be just the same shape now the value is going to be a matrix multiplication of the attention matrix cross these this value tensor now we know that this attention matrix is this shape and the value tensor is of this shape so what we're going to do is we have it for every single batch for every single head we are now also going to have it's going to be attention that's 200 cross 200 and then we multiply that with 200 cross 64 we're going to get a 200 cross 64 so what this is is for every batch for every head for every word we now have a 64 dimensional embedding that is the value tensor and we're going to return both of these two back so let's now go back to the logic where let me copy this over so now we have an attention and value matrices so we'll have attention is going to be this and then the value is going to be let's just copy this which essentially the same thing but 64 just for our reference now we're going to reshape the value tensor to just be this dimension so let's see if the math works out here so that's batch size is 30 the sequence length this should actually be the max sequence length over here which is this is going to be 200 and then the number of heads is going to be 8 times the heading dimension which is going to be 100 which is going to be 100 and which is going to be 64 so 8 times 64 that's 512 and if we compare to our old value tensor over here that's 30 cross well what we're doing is we are essentially taking this 30 we're taking this 200 but we're multiplying 8 times 64 to be 512 so a little bit of a rearrangement is going on but essentially we now essentially have what we actually input to our function for a multi-head attention so whatever goes in the shape also comes out and so this entire out vector over here is just going to be kind of like this x but just way better in terms of contextual awareness and it's because that they have like the same kind of inputs and output shapes that we can cascade them one after the other as many times without worrying about like disrupting some code logic here now multi-head attention is now going to return all the way to the encoder layer right here so where we input I should have said this too over here where it's going to be 30 cross 200 cross 512 but essentially even after self-attention we're going to have the same 30 cross 200 cross 512 now dropout nothing changes it's just going to randomly turn off neurons and when you randomly turn off neurons it doesn't really affect the output and then we're going to perform layer normalization so let's actually get into layer normalization as with any kind of module layer normalization has a constructor as well as a forward pass for forward propagation now parameters shape will actually tell us along which dimensions we want to perform the layer normalization on and typically this is going to be your embedding dimension so in this case it's going to be like 512 and that will be like the dimension itself now EPS is just a very small epsilon value because we're going to perform some division over here like we have a standard deviation where we have this epsilon and this is going to be a denominator so in case that the standard deviation becomes zero at some point it's to prevent these infinite values that might occur because of division by zero gamma and beta now these are two learnable parameters that we are going to as we you know as the network learns it's going to be updated continuously in this case these are going to be in this case they're going to be 512 dimensional tensors both of them are 512 dimensions that we'll just define over here now this gamma is effectively going to represent like a standard deviation of values whereas beta is going to represent the mean of values that will be applying you know that will be you know learn continuously during you know as we get more and more examples into the network so now when we take our forward pass we'll provide some inputs now recall that the inputs have a shape of batch size cross max sequence length cross 512 which is D model now this dims over here is going to be just the length of parameter shape in this case it's going to be just like negative one I believe that we're just basically saying this is the last dimension along which we want to perform layer normalization and that is true we want to perform layer normalization on the layer dimension so what we're going to do is we're essentially just going to take the mean of all of the values let's say we have a word vector right 512 dimensions we'll take the mean of all of those values and we'll get only one value right so that's exactly what we're doing in this step and keep dimensions true would be that well instead of getting a 30 cross 200 we will get 30 cross 200 cross 1 because we want to keep all three dimensions here and that's why we set it to true and now we compute the variance which is basically taking every input value subtracting the mean for that layer and then we're going to take the overall mean and that will be our variance here and this here is also going to be another 300 it's going to be 30 cross 200 cross 1 and so our standard deviation is just going to be the square root of the variance and which is also hence going to be just the same shape because we're just scaling it now what we're going to do is take the every single input in that dimension like 512 dimension we'll subtract the mean of that 512 dimension and divided by the standard deviation of that same dimension and what this allows us to do is now come up with an output that is again 30 cross 200 cross 512 but every single one of these you know word vectors is now going to be layer normalized so we'll have values such that you know the mean is 0 and variance is going to be 1 and this is the same concept as like normalizing your your data and hence the name layer normalization because we're normalizing data by the layer now we when we're applying layer normalization though this is only applied to like a single sample or at most a batch of samples but we want to make sure that these numbers are applicable for across the training set and so that's kind of why we have learnable parameters gamma and beta that will kind of help us in making sure that we are scaling these values why appropriately so that the the eventual output tensor that we get is going to be comparable throughout every single example and so when we apply it like let's say self dot gamma we basically multiplied with every single value of y we are going to get the same 512 dimensional output and I can put a multiplication with like even though gamma is not the same shape it's like 512 it's effectively going to say we have one value of gamma that is one learnable parameter for an entire batch and for like an entire sequence of values and so for this entire thing because we have 512 dimensions we are going to have 512 learnable parameters in gamma and another 512 learnable parameters in beta so this concept here because you know there are different shapes it's the same as broadcasting that I mentioned way up here when discussing like how we're adding how we're able to add two different vectors of different dimensions so I hope this makes sense and now whatever was input the same shape is output and we now end up with a layer normalized output so let's take this and now what we're doing here is we would have added the residuals which wouldn't have changed the shape and we now perform layer normalization which we know also just doesn't change the shape and again now we're taking a residual connection just to repeat if we go to our case over here you'll see that let's see go over to the figure we're now at this stage over here where we're taking a residual connection we'll pass it through a feed forward layer and then add and normalize and that's kind of exactly what we're doing in this step over here so we now pass it through a feed forward layer which is essentially going to be these point wise feed forward now this point wise feed forward ticks in D model and also that hidden dimension of 2048 so let's actually go to this really quick over here so we have a constructor we have a forward method now this is going to be a linear layer which is 512 512x512 that's how it's going to transform it and this is another linear layer which is going to be a rather 512 cross 2048 and this is another layer which is going to be 2048 cross 512 we're going to apply relu here no relu here is an activation function and there are different kinds of activation functions so activation functions in general they help neural networks learn more complex patterns so typically wouldn't we say that there is no activation function that is you know without relu what a function might have looked like a straight line that just basically passes through and you know what straight lines you're not able to capture as much information however if you're able to do you know relu is an example of a piecewise linear function so that means it's made up of multiple straight lines and so it's better able to kind of capture information and also it kind of in its own way also acts as a regularizer where it will turn off certain neurons activation now because this tends to lead to a problem in some implementations you might see something called leaky relu that is adopted that allows some information to be passed through here and we can also have used like the tanch operation which would have you know constrained the values between negative one and positive one and then we have like the sigmoid activation which we typically use when we're interpreting probability values on the out or like the last layer of your neural network so depending on where your use case is we would typically use these activation functions and a commonly used one is relu and so I'm using it here so we have relu and then dropout which don't really change dimensions and so we have our input which is again 30 cross 200 cross 512 we then now take it pass it through the linear layer this linear one is 2048 so it's going to now be expanded to 2048 pass through relu which means that it doesn't change its shape but the values will change because it's activated and then we perform dropout nothing changes and then linear two which means it comes back to its original 512 dimensions and so the input shape is the same as the output shape and that is what we are returning back and so that's our point wise feed forward layer and if we go back here you'll see what comes in also comes out so essentially this FFN or feed forward network is just going to return the same dimensions dropout like I mentioned will not change anything and also layer normalization which we did previously will not change anything and so even here for this entire encoder layer whatever is we input here is what we are outputting as well at least in terms of shape now because of all these operations though like I mentioned before this new output X is just going to be much more context aware than what was the input over here and so the control will now return back to well self-layers where we're going to execute this same exact process the same forward pass like five times because we have five layers and in each case we're just going to not change the shape the input shape is the output shape so we can just keep recurring through them and then once we finish the sequence we will end up with the final vectors X that happen to be just really good at just a encapsulating context and when I say really good at encapsulating context this is only after some training has commenced as it just gets better and better as we look at examples now all of the shapes that you see that I've mentioned here I think I've covered all the shapes that can possibly happen here all of them are actually written out in a flow right here when you when I execute it on dummy data here so you can kind of see the flow controlling going from attention to dropout layer normalization where we start out with the same tensor 30 cross 200 cross 512 and even after like the sea of operations you can see that we'll end up with an output of still 30 cross 200 cross 512 all right so I hope we got now a good idea of how to code out the transformer encoder so just as a quick refresher let's say that we're trying to build a translator from a language English to another language which is in the Indian subcontinent called Kannada now in order to do this this is an encoder this is the decoder we pass in the English words of the sentence simultaneously we'll get word vectors simultaneously over here then the encoder phase is done now we take all of these vectors pass it into the decoder in this multi-head attention and one we have like a start token that we want to pass into the decoder and we'll generate the next word this word is then passed in right here to the beginning and we generate one word of the decoder at a time so how does the decoder actually do this? well we're going to talk about that exactly by blowing up this entire decoder architecture in terms of code so first we're defining this D model which is essentially the size of all the vectors internally within the entire transformer neural network and I define it to be 512 so that means that each word is represented by a 512 dimensional vector the number of heads here is the number of attention heads that we want when we're performing multi-head itself attention as well as cross attention drop probe is a parameter that we use for a dropout which is the randomly turning off of neurons and this random turning off of neurons in a network helps the network learn along different paths and helps it generalize better batch size this allows computers to process multiple in this case sentences together all at once and this enables faster training as we're not typically just in the stochastic gradient descent updates we would pass some input to the network generate an output, generate a loss perform back propagation update millions of parameters and this is only after seeing one example and we have to keep doing this for every single example we see but with batch gradient descent we only need to update it let's say in this case I set it to 30 that means only when we see 30 sentences which we pass in all at once it's only then that we perform a batch update and so now you'll see that like the batch updates don't happen very jagged they happen much less frequent but also more smooth next we have is max sequence length which is the maximum number of tokens in this case tokens are words that a sentence can have and I've set this max token to be 200 to be like okay the maximum number of words in the English sentence that we can possibly pass is 200 and the same also goes for the output Kanada sentence I've just set it to be the same value but they could be different values FF hidden in the entire network we're gonna have like linear layers that occur and whenever we do have linear layers I want the output number of neurons in the hidden layer and that specific feed forward layer is to be 2048 this is just a hyper parameter that I found in the paper attention is all you need but you can essentially set this to whatever value you want it's just for the propagation of information next is num layers this is going to be the number of encoder layers and decoder layers that we want on repetition so if we go to this diagram of here essentially this n I am sending to five because they're both the same so essentially what's happening is that we pass some inputs to the encoder it's essentially going to go out here but we're gonna have a cascade of these encoders one after the other so you can imagine just like five of these encoders stacked on top of each other so inputs go in it passes through all five encoders then the output goes to the decoder over here and then we have in the meantime we also have like the Kanada that Kanada language output or in this case it'll be input to the network it goes through this decoder and it passes in five times before it eventually generates some output probability and hence predict the next word now why do we have like a number of castrated layers here is to deal with the complexity that language has so the more complex the patterns could be in the language itself the more intricate you would want your architecture to be so that it can capture those patterns well and that's kind of why we use like a multi encoder layer multi decoder layer transformer neural network architecture here now x here is the input that we would see to the decoder but it is the English sentence so you can imagine x is going to be just you know after we pass some inputs we get some positional embeddings we pass through the encoder it will now be context aware every single vector is context aware and that those vectors are what x constitutes and we'll be passing it in here later now I have like the almost like some similar thing which is y that I've generated over here that is batch size cross max sequence length cross d model so this is essentially also a sense of like the every single word has been encoded into a 512 dimensional vector but this is merely just the input part so why is simply like this stage right over here at this point I've just generated some random values but they would actually come from some you know data set and we would have transformed it for this video it's a little too much code but I will show this in subsequent videos for sure but for now just to get the use to the entire flow I've introduced it here now this mask that I'm creating over here is actually going to be a look ahead mask this is required because during training we pass all the words of like the target sentence at the same time but we don't have information about every single word in that sentence at the time and so we would need to mask some inputs so that we can prevent ourselves from cheating and looking ahead and so we would create a look ahead mask if you're curious about how that mass looks it kind of looks like this where we have for every single this would be a 200 cross 200 tensor where we have you know for the first word it's like saying that we don't need to mask it zeros are don't mask negative infinities are mask so for the first word it's allowed to see itself the second word is allowed to see itself and the word before it the third word is only allowed to see the stuff that happens before it and so on until the very end so this kind of prevents cheating from happening and hence we'll prevent a situation where the decoder performs well during the training time but terrible during the test time because of data leakage and then we now pass in all of these values calling the constructor of the decoder class which we will take a look at and then we call the models forward pass and for the forward pass I've done a lot of print statements over here so you can kind of see exactly how what was the dimension of every single tensor that occurs and I will provide this code all in the description down below but in this video I'm also going to explain why these dimensions are the way that they are so I've pasted all of the decoder code actually right here in the collaboratory notebook so this code over here and this code are the same and so I'll just be using whatever is in the decoder collaboratory notebook let's start with the decoder class here so I've defined a decoder class to have a constructor which takes in all the parameters that we just mentioned before as well as a forward pass now I'm overriding this torch modules because module itself is a PyTorch extension that provides a lot of the underlying boilerplate code that's required for creating neural networks this includes like performing back propagation it performs like memory management and also managing like whether like tensors are supposed to be on GPU during training time versus on a normal CPU machine during testing time and it also facilitates model forward passes too and so because of all of these nice little intricacies that torch provides I use torch modules over here and because we use torch modules we also can now provide a forward pass and we can do things like call this entire method directly as if it were just a method this would instantiate in fact the forward call or the forward function so when we call the forward pass we have the English sentence the Kanada sentence as well as the mask so let's actually write out all of the shapes of these values so that we can kind of walk through its execution so x is going to be the English sentence which is going to be 30 cross max sequence length which is 200 cross 512 and we can say the same thing is going to be for y as well y is also going to be the batch size which is 30 sentences cross 200 which is the max sequence length cross 512 and then the mask itself at least for this simplistic case is going to be 200 cross 200 which is max sequence length cross max sequence length now we're going to be passing this into layers so layers here is a sequence of you know decoder layers so essentially if you look at this figure over here I've this here is one decoder layer but it's repeated n times and so if this is under a class called decoder layer we repeat this n times that will be what self dot layers is essentially now typically you would use something called torch n n sequential that looks kind of like this right over here but here you can see I'm using sequential decoder and this is mostly because like when I want to call the forward pass if I'd use just like sequential on its own you cannot pass in more than one parameter and because I have more than one parameter to pass like I need to pass in a mask for example that's kind of why I've implemented my own sequential decoder because I just can't use this directly so it extends sequential and what it will do is I'm just going to take this input but when I'm calling every single decoder layer I'm going to be passing in all of these parameters but I get the new value of the canada outputs that is the output sentence every single time and I just keep feeding it back into this module but the same value of x that is you know from our encoder that is the English sentence is just passed in regardless of which decoder layer that we are looking at and thus sequential decoder thus becomes an array of these decoder layers let's just actually take a look at what each decoder layer looks like so if you scroll up here you can see also decoder layer it extends you know torch modules and hence it also has a constructor and a forward method that we can call directly now let's again walk through the forward pass and I'll describe each of these as and when they appear so first of all we're assigning this here to like this underscore y and this is going to be used for residual addition so you can kind of see in the normal paper here you see that there are these arrows over here these arrows are essentially skip connections or residual connections and you would typically see these in very deep neural networks this is because as like the deeper the network is you know as you do a back propagation of your gradients they become smaller and smaller and smaller and eventually they might vanish towards like the beginning of your network and if gradients vanish that means that your neural network will no longer learn and so in order to propagate a signal much stronger we use skip connections they propagate like the input signal much better in the forward direction and also like any kind of loss or I should say like the derivative of that loss with respect to the inputs they will propagate it very well in the backward direction as well and so there's always going to be like some gradient update that occurs even for like the earlier layers of the network and that's kind of why like this architecture uses those skip connections what we're going to do is now perform the mass self attention where we take the Kanada sentence and we're also going to have you know the decoder mask which is the same 200 cross 200 mask that I showed before so this self attention actually you can see over here is multi-head attention so let's go over to that module multi-head attention which is right over here in the multi-head attention you can see that because this is the Kanada sentence in this case it's going to be 30 cross 500 and no 30 cross 200 cross 512 that's the dimensions right that's batch size max sequence length and demodel now we're going to have query key and value so this qkv layer if you can see here it's a linear layer that will basically map this 512 dimensions to the three times 512 dimensions which is 1536 I believe that's the multiplication here so looking at the decoder architecture it's kind of like this where we have batch size max sequence length cross 512 and we create query key and value vectors for each and every single word and that's why since one word becomes now three vectors that's why we see 1536 so the query vector for a word is what am I looking for the key vector and the value vectors essentially are like you can consider them as some form of memory or you can also consider them as what do I have to offer so it's kind of like its own information that it already knows so coming back to the code this is now going to be a qkv which would be 30 cross 200 cross 1536 as we've created query key and value tensors which are going to be used for attention but we are also reshaping this because we want to perform a multi-headed self-attention so we're kind of distributing this across eight heads and because of this reshape I'm quite literally going to write this out as 30 cross 200 cross the number of heads which is eight that's how much we've defined over here it'll be eight that we passed through cross if you kind of just do the multiplication I think it's like 192 that's 192 times eight will be 1536 and so we've created eight attention heads and if we look at our decoder what this why we're doing this is that these are essentially going to be eight parallel processes that are going to go on that is each one of these rectangles there's like eight rectangles over here and each one of them is going to perform its own attention and then in the end we're going to be concatenating them together so coming back again to the code this is going to be 30 cross 200 cross 8 cross 192 we're just going to be reshaping this so let's just write this out again so it's 30 cross we have the second dimension which is eight cross the first dimension is 200 cross 192 that's just as an advent of the reshape now we'll be chunking this up and to by the last dimension into three parts over here so that's 30 cross 8 cross 200 cross 64 that's 192 divided by 3 and this is Q but you can also say that the key and the value vectors will also be the same so I can kind of say this is going to be the key as well and this is going to be the value vectors here as well now we perform scale dot product attention so this is like the crux of the attention mechanism over here and we're passing each of these query key and value vectors along with a decoder mask so let's go to our scaled let's copy this because we probably want to know what the sizes are for these in fact we'll do that for everything over here we go over to scale dot product attention which is like this right here and we'll just copy some of these sizes and I know the mask was like a 200 cross 200 just to keep this in mind so DK here is actually just a constant that we are going to be using in a division over here and this is required to scale the multiplication the matrix multiplication of Q times K and we require this because we want all of these values to kind of be with like a mean zero and standard deviation of one and this scaling parameter actually just helps us accomplish that now in fact we can actually see this in action by like you know I'm in my github repository right now where you can kind of check out all of the code for this video and also like the past videos that I've created on transformers over here and you can see that let's we want to perform this operation of self attention where we definitely we have like a Q query vector and the key vector which we do like a matrix multiply over here but now the query and key the variance is like close to one but the variance of like the multiplication of these you could see that it kind of goes off the charts a little bit like it's five to six times greater however if we were to like you know scale these values you can see that the variance now becomes much more tractable when we perform the multiplication here we transpose the last layer and the last and the penultimate layer so that means that we we actually make let's say if this is the key vector we actually transpose just both of these and we can't just do like a k dot t because otherwise that would transpose like the entire thing it would be a 64 cross 200 cross 8 cross 30 tensor which is not what we need we just want to transpose these last two layers and so if you do a matrix multiplication of query and key that would be 30 cross 8 let's do that 30 cross 8 cross and this would be a 200 cross 64 times you know 64 cross 200 which would be a 200 cross 200 scaled tensor in this case it's scaled but this is just a division it doesn't change any shape so this is kind of like the initial parts of how the attention matrix is actually going to look what we do is if mask is not none in this case it is not none we're going to add this mask and we're adding a mask of 200 cross 200 to this tensor so even though it is a different shape PyTorch supports something called broadcasting where the last dimensions here match the last dimensions over here it's still a valid addition and it's just going to add the same exact tensor or apply the same exact tensor to every single one of these elements so there's like 30 times 8 which is like 240 so that's like for every batch for every head we're going to apply the same exact mask and so this mask is now going to end up with it's going to be 30 cross 8 cross 200 cross 200 after the operation of this masking is done just note that in this case this is a look ahead mask but we could have also added some padding to this same mask and we would have still ended up with the same dimensional tensor once the entire you know operation was said and done next is that we perform softmax now softmax is just like taking exponentials and dividing it by the sum of exponentials which essentially does not change the dimension of anything here the attention matrix is still going to be the same dimensional shape it's just going to do the softmax on the last layer this last layer over here so that every single row is now going to add up to one now the values here is going to be a matrix multiplication of the attention and the values vector which is this is the attention and the value tensor looks like this so this is going to end up with the shape of 30 cross 8 cross 200 cross 64 right and this is just for one attention head and we're going to be returning both of these so in the self attention case this attention is going to be how much attention are we paying for every single word on itself whereas the values is going to be well what are the actual context aware final tensors that we are looking at right now for every single word so that's for every batch for every head for every word we have a 64 dimensional vector that represents context of that word and eventually when this logic returns we copy this over the will come back all the way back to multi head attention over here this logical return and both the value this is the values tensor and now we're going to well reshape this such that we flatten the heads dimension as well as you know this 64 over here and so this would be 30 cross 200 that's max sequence length cross 8 times 64 which is 512 and so effectively we are concatenating whatever we found from each and every single one of the eight heads together and then we have this output that is just going to be passing through a linear layer and this linear layer actually doesn't even change the dimensions it's like a feed forward layer that maps like 512 dimensional vector to another 512 dimensional vector itself and so we have the same shape over here so I can just copy and paste that over here and it's this now that is passed out so interestingly here because this multi head attention you can see like it's the same input shape that was input same that was output and so it can also be a repeated layer without like disturbing anything like very jarringly like oh the shapes don't match you won't be getting like such errors here so now once that multi head attention is complete the control is going to now transfer back so self-attention is now complete and the output is this shape now we perform dropout dropout as I mentioned before is the random turning off of neurons so that the neural network can learn a longer more generic generalized path better for generalization and hence it doesn't memorize so it's the same shape it's just an operation though next is add and layer normalization so we can add whatever we saw here which is the same shape essentially this was the same shape up here this underscore y we add these two together and then we just normalize them with layer normalization so let's go to norm one norm one is layer normalization and we pass a parameter shape over here called little parameter shape and we pass 512 so let's actually look at layer normalization for a little bit here going up to layer normalization all right so we have like this dims over here and this dims is going to be what along which layers do we actually want to perform layer normalization the entire goal of layer normalization here is for constraining the values of every single layer to be with like a mean zero standard deviation one and this helps stable training so that we can jump you know the jumps that happen during like every single training phase and gradient update phase are pretty stable they're not too erratic and so this here is going to be I believe dims is going to take on the value of just the last layer so it's just going to be like a list of negative one in this case that's what dims is next we're going to compute the mean along this last layer now the inputs just for clarity are going to be of this shape which is 30 cross 200 cross 512 that's exactly what we're passing in so we're taking the mean of all of these elements here that is all the 512 dimensional vector that represents a single word we're going to take the mean of that and we say keep dims is true because we if you just take the mean of these and you you know you run this operation it's going to just return to 30 cross 200 dimensional tensor but we want a 30 cross 200 cross one dimensional tensor so that we can perform some other operations down here so that there's like a the tensor shape dimensions match so one step at a time we take the inputs which is the shape we'll subtract it for every single mean that means that every single word is going to be having like there's like a single mean that we have computed and we're going to subtract basically that same mean value over here so you're going to end up with the same dimension over here and so this is going to be the same dimension that's the variance and standard deviation is simply the square root of that variance which also doesn't change the shape over here and the output hence also doesn't really change at all the only difference is now this y is going to have much more normalized values where the mean is zero standard deviation is one and hence allows more stable training but what you see here too is like there are some gammas and betas so gammas and betas are a set of learnable parameters in this case there's going to be like 512 essentially of these that's the parameter shape so all of these gammas are initialized to one and all the betas are initialized to zero so there's essentially like 512 gammas and 512 betas that we'll be learning over time and the goal of these learnable parameters is kind of like if you've seen like batch normalization too where we have essentially gammas and betas they help learn along dimensions that are not included in this entire operation of like computing the mean and standard deviation and so even like even though we do like some strange normalization along like this 512 dimension we're trying to make sure that all of these parameters become kind of like comparable to each other and that's kind of why we introduce all these we introduce like so many gammas and betas we have like one for every single batch and word that we have like a specific gamma value and so what we do is we multiply gamma times y that is the value of gamma times every single element in y and just add like beta so gamma is essentially going to be like a standard deviation it'll learn to be like a standard deviation across multiple examples and then beta will also learn to be like a mean across multiple examples so that it can better so that like this out over here is better comparable across these multiple examples and after performing this operation we still end up you know it's still the same shape right now we have the same input same output shape as well so when we return the control now from layer normalization we go back to where we go we go back to decoder layer and the output layer normalization is the same shape right and we just perform the same operation here but instead of massed attention we now perform cross attention which is slightly different now in this cross attention case I'm not passing in any mask since like the encode this is going to be the difference between self attention and cross attention which I will mention actually let's just go over here so it just becomes more apparent as to why we're not passing a mask for now so multi headed yeah this is multi headed cross attention you'll kind of see like the code is very similar to what we saw for self attention however the main difference you'll see is that instead of taking a QKV layer we have a Q layer that's separate from a KV layer so every single element that is in your encoder that is like the the connoisseur words that we pass in the decoder they are going to form query vectors but then we are going to convert every English word that we get as the output of the encoder to key vectors and value vectors so you can see for self attention we created you know from the same connoisseur words we have query key and value vectors but for cross attention if we go all the way here now we're performing cross attention let's say that this is coming from the English encoder this is like the set of English words for the sentences in the encoder each word is now encoded into a key vector and a value vector from the encoder but for the connoisseur side we encode everything just into a query vector and so while self attention everything is in every word is encoded into a key key query and value in cross attention we have the query coming from the decoder but the key and the value coming from the encoder and then we perform very similar operations along with no mask in this case so that's why coming back here we'll see that like the inputs here like x is going to be you know the same dimension 30 cross 200 cross 512 and so y will also be that dimension but now after passing through the key and the value again key I say every word is converted into like key and value but essentially they're going to be stacked on top of each other as a linear layer so we just convert 512 to a 1024 dimensions over here and so because like every word is going to now have to like a key and a value from the from the encoder side this is going to be 30 cross 200 cross 1024 every now for the query side it's going to be 30 cross 200 cross 512 and we're going to now reshape this so that we can perform multi-headed cross attention and so the key and the value let's take this but instead of like 1024 it's going to be eight times this headings dimension is 64 so I believe it's going to be 128 and we're also going to reshape our query vector too where the query vector is essentially also going to be multi-headed so we'll take this we'll paste it here and this is going to be times eight but instead of just two times the head dimension it'll be just the head dimension itself which is 64 now we're going to permute the key and the value over here so let's just take this but we'll switch over the second and the third places so that's 30 cross 8 cross 200 cross 128 and the same kind of goes for for the query itself too which is 30 cross 8 cross 200 cross 64 and then we will chunk the key and the value vectors here into just two parts based on the last dimension so we'll take this and we're breaking into two parts so this will be 64 for the key and then we also have the same thing for the value essentially and so we have query key and value tensors of like the same shape and this is where we pass this in now all through the scaled dot product attention again so if we go back up here we actually end up with the same shapes right 30 cross 8 cross 200 cross 64 they all have the same shapes essentially in the mask in this case we don't have a mask but all the operation shapes remain exactly the same and even if we don't execute this code it's not like the scaled this scaled value it still maintained its original shape anyways so it wouldn't matter I hope that what's clear here is that in the decoder case we actually don't even need for the cross attention part of the decoder case we actually don't even need to execute this unless you put in like a padding tensor as well or like a padding mask so that you can ignore some of the padding tokens which is required because you know some sentences might not be as long as like the total length the maximum sequence length of like 200 but unless if you don't do that you don't really need this this section over here that's not going to be executed and hence it's going to skip over and these shapes are going to be preserved so it's kind of exactly what we want and so what you're going to get is this output tensor over here put in our value tensor and now we are going to now concatenate all the heads together as we did previously and so this is going to be 8 times 64 which is 512 so we'll just reshape that and this is now going to pass through a linear layer which is not going to resize anything it's just for additional information and so we actually now have an out vector that is going to be you know 300 cross 200 cross 5 or 12 which is exactly what we input in here so we copy this this is multi-head cross attention so let's go back to the decoder layer and the output here of y is still going to be the same shape next we perform you know kind of the same procedures as before we perform dropout which doesn't change the shape it just randomly turns off neurons and we'll also do layer addition with layer normalization which does not change the shape again and next we still have this so we pass it through a feed forward connection now this feed forward network ffn I've defined it as position feed forward let's actually go to that just real quick it's actually quite a simple function so it shouldn't take too long so we pass it through the input that we have which is x is going to be like you know it's going to be this shape right that's x we pass it through a linear layer which is going to map it to a hidden dimension right so that's 2048 and then we pass it through relu which doesn't change the shape it just it's an activation function that will allow the network to better understand and model complexity in your data and complexity in patterns typically like relu does have like some issues with like the dying relu problem because you know for any activation value that is lower than zero it will just completely shut it off that's just the nature of the function and so some other functions like elu, gelu, leaky relu are also used which you can experiment with too but for now we have relu and that doesn't change the shape dropout doesn't change the shape and then linear2 which is just going to be the inverse of the first linear layer where we still now get this demodel so let's just write this out and this will be 512 dimensions and this is return and so at the end of the day when we do our position feedforward network we are going to end up with a 512 dimension output here which is the same as everything that we've seen before next we perform dropout like we said that changes no shape layer normalization changes no shape and we return the value that we get at the output of one decoder layer and this output is exactly the same dimensions that we input to the layer itself so we take this value and this is the decoder layer that we will go over here this decoder layer let's just put this in a much more readable way so this is the output of one decoder layer so this module executes once and now we have this y which would have been this shape over here and then we just input that same y you know this has to go num layers we defined it as like we defined it as 5 so this will execute like 5 times so now this new tensor which is slightly more context aware of the Kanada output language is passed back into this module and we keep updating y which is like the Kanada target tensors for the Kanada words now once all of these are complete the control will pass back over here which the output is going to be well the same shape right and this is good so the input now matches the output and from this output I haven't coded it out here but it's essentially going to be mapped into a softmax a linear layer of softmax and from that we can actually compute what the next words could be so everything that I've kind of like discussed over here I actually like put so many print statements so you can kind of see like I've printed it all out in the end you will see that's exactly what the output shape is 300 cross to 30 cross 200 cross 512 and to see exactly like what happens you know in the conceptual architecture we're essentially way at this stage over here we have the batch size cross max sequence length cross 512 that's exactly what's he 30 cross 200 cross 512 where we now can map it to a you know we can map each of these now to a Kanada word that is human interpretable and perform some softmax to determine what is the word that is supposed to come next all right so now that we have a good idea of even how to code each of these components of the encoder and the decoder now I want to kind of focus on the beginning over here like how are we going to take language and encode it into a format that computers can understand so let's talk about that now so in this Google collab notebook I have a couple of files here that consists of the English sentence which is the source sentence as well as the target Kanada sentence which is the sentence we want to translate to now if you want to know how that information looks it kind of looks like this where this is the Kanada file where there's about like I believe like four million records over here that's four million sentences and then we have a train file over here which is the English translation for every single corresponding Kanada translation over here so this is he is a scientist this is the Kanada translation of he is a scientist and we need to kind of like clean this file, process it, create a dataset and then feed it to our transformer neural network so the source of this dataset is Samanantar if you want you can actually kind of read this paper of what it contains but essentially it has the English translation so for every English file there will be 11 files where you have 11 Indian languages and those languages are over here which are short forms of some of these Indian languages I am using the KN file which is the Kanada file and you can just see like some more details of like what are the number of records and our tokens and whatnot and also all these like they have they also had benchmark analyses so I would just recommend that you read this paper for more information but if you actually want to download this dataset you can hop over to this link which all of these links will be in the description below and you can just click like on the downloads and then download everything I will say a big warning is that there is no good way to download just a single part of these datasets like I mentioned there's like four these files are pretty large they're gigabytes of data and in fact I think you need about 20 gigabytes of data to download everything and unfortunately you only have an option to download everything at once and not just like Kanada English or Kanada Hindi so I would just recommend that you have some space on your computer before you try to download these coming back to our file we want you to find a start token padding token and end token so start token is going to be to begin a sentence end tokens to end a sentence and a padding token is typically used because we want to make sure that even though sentences might be of different number of words or characters we want to ensure that they are converted into numbers, vectors, matrices that are of the same length and hence we introduce padding tokens next I'm going to introduce a Kanada vocabulary which is a list of a set of all possible symbols that we can input and output to and from the model and the same is true for the English case it's the number of characters that can be input and output to and from the model let's talk a little bit about these individual languages so that we kind of get an idea and a better understanding of what we can expect moving forward for translation now these English characters here we have a character set that consists of consonants and vowels and we call this entire set an alphabet as each character can represent an individual sound and so on Kanada is kind of much in the similar way of we have consonants and vowels so this first row is a set of vowels and so is this third row so that would be and then we also have a set of consonants so it's I just read the entire thing wonderful even though we call Kanada an alphabet when just speaking normally it is technically more an alpha syllabary where each character here clearly just represents a syllable at least here but we can also combine them together we can combine consonants and vowels to actually create a sound like a unit here so for example this is and that becomes so we're combining a consonant which is and a vowel which is one of these in order to create like different units of characters here and so Kanada is an alpha syllabary and like we did for ka we can do that for every single other consonant and vowel pairing to get a set of these units of consonant vowels and this is what we call in Kanada a ka gonida and so since English is more of a phonetic alphabet language whereas Kanada is more of an alpha syllabary you can kind of see that there would be some form of complications that can occur when you're translating from English to Kanada they're very different kinds of styles of writing so it's fun to keep in mind though now for every single one of these lists of Kanada and English vocabulary we want to create an index so that's a dictionary that maps some integer to a character that you see up here or a character that maps also to an integer and I'm just creating that index over here now we're going to read these files that we mentioned over here entirely from our google drive and we're only going to pick out the top 100,000 sentences for now so that's faster and easier to train we probably don't need four million but if we do need more we can always pull more by just increasing this total sentences value and I'm just going to get rid of these new line characters that are appended at the end of each sentence and then just print it out for you and so you'll see like the English sentences look like this and their Kanada translations correspondingly kind of look like this now as an input to this transformer neural network entirely either the inputs or the outputs we're actually going to convert every single character into some embedding instead of every single word as we have probably discussed in many videos before when we encode every single character into an embedding we want to ideally it should be a little bit smaller so that there's not too many parameters to learn inference becomes faster but over here you can see that the maximum Kanada sentence it has 639 characters while the maximum English sentence has 722 characters now you can go and plot a distribution but I'm typically sure that you'll see a very long tail curve where there might be only a couple of sentences that are very long and I don't really want to just accommodate all of those sentences that are just super long even there's only like a few of them anyways I would rather just accommodate the majority and try to decrease my dimension so that just becomes easier to learn less parameters throughout the network and so what I'm doing is I'm just trying to see the 97th percentile of like the longest word so this is basically saying that there's 172 Kanada sentences in my data set that are less than 172 characters and there's only like 3% that are more and with English we have a very similar size so what I'm going to be doing now is defining a maximum sequence length that is the maximum number of characters in a sentence should be only 200 anything more than that will just get rid of from the data set and so I've just written little helper functions that kind of just check these conditions whether they're actually first of all whether they have valid tokens that is all of the tokens that are present in each of these lines is actually a token of the vocabulary that we've described up here so only if this is true and also it has like a valid sentence length that is less than 200 characters and then those are the ones that we actually use in training the transformer and so we reduced from 100,000 to just 81,900 which is completely fine and why it's like such a huge reduction is mostly because both the English and the Kanada sentence translation have to have these two parts satisfied next we are going to create a data set so PyTorch has a predefined class called a data set which is required in order to feed data and train a PyTorch model this kind of takes a care of a lot of the boilerplate code under the hood so that there's consistency in how we fetch batch and everything else with data and in this case since we're working with text we create a text data set class and I want to override we have to actually override when you're creating a custom data set a get item the length and also a constructor if you require it so in my case this length is going to be used internally by data set to get whatever the length of the current sentence is but whereas get item here is going to take in an index and get the corresponding English and Kanada sentence which we retrieve when we're going to be training so we're iterating over a batch we'll be getting an English sentence and a Kanada sentence and this function is going to be called to fetch that data I'm creating a custom data set because there isn't really this data set class doesn't really satisfy my own needs but you can check PyTorch's repository for data set just to see if what you have is already built in otherwise you can build a custom data set like we are doing here and so when you execute data set and you'll just like get let's say the second element in that data set you'll see that you'll return a tuple of the English sentence as well as the corresponding Kanada translation now for the sake of this entire setup let's just say that the batch size is three now to explain why we were batching in the first place is that let's say that we just take one input sentence and one Kanada translation as just the batch size so batch size one which is essentially no batching whatsoever so if we pass in one Kanada sentence an English sentence during training we'll get some output loss function and then we're going to be performing back propagation update all of these like millions of parameters over here in order to now get a new state and then we will repeat this again with passing another English and Kanada sentence and again all of these parameters have to be updated updating each and every single parameter for just every single example can take a very long time and also your loss steps will also be very jaggedy so in order to speed up this training we kind of parallelize passing in information to this network so in this case I said three so we can put three input sentences in Kanada in English pass it through the network and only after all of these have been read we only generate a single loss and then we just back propagate that loss so it's only the parameters are updated once for every three that are input so we can increase the batch size to decrease the number of times the entire network is going to be updated and this will speed up training and hence we use pretty typically for many machine learning cases we use mini batch gradient descent let's just say that we set the batch number to three so what you'll see is actually two tuples of data so for the first one you're going to see three over here there's three English sentences one that are comma separated two and three and then that's going to be up till here and then you'll see the corresponding Kanada translations in another tuple over here and that's kind of how the data is going to be given and processed during training next is tokenization so we have these sentences but we need to convert these into numbers because computers don't understand text they understand numbers and so I've created this tokenized function over here that'll take a sentence and it'll take also whichever language depending on what you want to other English or Kanada it'll take that character to number embedding and it'll take an optional start token and end token depending on whether we want to pass in a start token or end token and so if we have a start token we'll prepend it to the beginning of the sentence if we have an end token we will append it to the end of the sentence and in other cases we need to have padding token so for the remainder of all of the characters we are going to just introduce a padding token that we discussed previously and then just create a torch tensor so that everything instead of a Python list it passes a tensor as everything in PyTorch is processed typically with tensors so to look at an example let's just say that we have a batch since the batch size is 3 it has 3 English and Kanada sentences over here now I'm going to describe some empty lists that is this is going to be the list of all English tokens in these sentences and Kanada tokens in the corresponding sentences now for every single sentence what we will do is we are going to call the tokenized for the English part but in the English case I don't want to use start tokens and end tokens we are passing them all simultaneously anyways we will have the entire English sentence so there's no need of a start token and end token now for the Kanada case I need to pass in a start token because during the generation phase you're not going to have any Kanada word to start with so you're going to have to inject something into the model and that will be your start token and I'm also going to pass in an end token as well just to indicate this is where the sentence ends and after that it's just padding tokens and so you can see that if you look at the Kanada tokenization so we have like these three Kanada sentences their corresponding like numeric interpretations would look like this this is the first sentence this is the second sentence and this is the third sentence so they've all been mapped to from characters to an index an integer from that character integer dictionary that we just created previously and we can see that these 123s are padding tokens this 124 over here is an end token end of sentence and this zero is a start token we can see something very similar for for English too if we wanted to try this out where 95 in this case is the padding token and then we have yeah there's no start or end token so it's just the padding tokens and everything else is just characters and the sentences that you see now in the last part of this video I just wanted to very quickly just touch on masking now coming to this transformer architecture you'll see that you don't really need a typical type of masking for the encoder part the only kind of masking you need is a padding mask and that's just to say hey do not look at the padding when you are computing this loss function and upgrading gradient weights here so don't look at the padding tokens they mean nothing so we might need to have like a padding mask interjected within the encoder but then that is in this multi-head attention part but with the decoder you need a mass multi-head attention which means that in the decoder since this is the generation phase and during training you have all of your Kannada translation data but during inference you don't have any of that so you shouldn't be looking forward to what tokens you haven't generated that's a form of data leakage and we cannot have that so what we do is instead we would mask all those tokens and say hey any character that comes after this current character we don't want to look at it we have no context to that we only have the context to characters that come before it and on top of this we also have a padding mask that is just masks that like kind of like what you mentioned in the encoder we just should not be using in order to compute back propagation upgrade and updates everything that I've mentioned just now can be kind of converted into look ahead masks and padding masks for both the encoder and the decoder and I've kind of printed all of these things out here now instead of like a I'm using zero that says hey this is not masked and negative and like a negative large number or technically it's like a negative infinity that says hey this is masked because eventually if you look at the code we're going to be passing this through a softmax function right over here and when you do a softmax that's essentially like an exponential function so that's whatever is zero it's e to the power of zero is going to be one which is okay that's passed through but if it's a negative infinity e to the zero is going to be zero which is don't pass through so that's kind of why we use negative infinity zero instead of like zero one and I specifically do not use negative infinity and use like a very low negative number instead because there will eventually be cases where your entire rows are just zeros and if all your rows are zeros during a softmax that means the output of softmax is going to do zero by zero this is going to lead to NANDs or numerical instability and if you get that as a NAND then the output loss is also going to be a NAND not a number and this is just not trainable and so I just instead just inject very little information into it and this is effectively not going to change your model too much and hence I do it here the output of this though and you can try this out is that well you'll get an encoder self-attention mask as zeros means that pass through until the character that you see and then you'll just get negative infinities then the decoder self-attention mask you'll see that it's a look ahead mask so only the first one is zero everything else is going to be you know negative infinity here only the first two in the second row only the first two are zero in the third row only the first three are zero and so on and then we have like a decoder cross-attention mask which is kind of more similar in that it has a padding mask like the encoder self-attention mask here I've put all of this actually together in a very cute class of a function which we call sentence embedding where I have this wonderful set of operations that I'm going to be performing and I also have this batch tokenization so we have a tokenization function everything that I've written out I've encapsulated a function to handle batches of data all right so now that we've taken a look at a good portion of different components of how to code out the encoder the decoder and even parts that come before and after it let's actually take a look at the entire transformer code end to end for the encoder and decoder and in this case though and we'll see that there's like 300 lines of code I'm going to explain all of this although we've already done deep dives in different parts of this code separately this section is just going to marry the two so it's not going to be a super long section but it will give you better context into how everything is connected with code so let's get to it so as a reminder we are currently in the process of building a translator from a language called English to a language called Kannada which is a language that's spoken in a specific state in India and the way that we're going to generate the translation for an English sentence is we're generating one Kannada character at a time and not one word at a time or one byte parent coding at a time as you would see probably some other implementations do so you pass in English sentences over here and for each character we will get a vector that encodes some contextual meaning to it and we'll take these vectors pass it into the decoder architecture along with the Kannada vectors which are transposed like by one word there's going to be an additional start token preceding the sentence and then we'll have during training the entire Kannada sentence but without the start token and so in the end this is a way to parallel train the model so that it's much faster in training and during inference time though we would just use the start token and pad the rest while passing it into the entire decoder architecture to generate the first Kannada word that word is then taken and passed into the output section whatever it says here in order to generate the second word and we do this until we get the end of sentence token to generate the full Kannada sentence one character at a time now with that context out of the way we can actually look at the code so this here is the transformer class that implements or extends the torch modules we have D model which is going to be the dimensions for every single character vector typically in this case we take it as 512 FF hidden is the hidden layers for the feed forward layers that we see throughout the network which is 2048 num heads is the number of heads that we see in the multi-headed self-attention and also multi-headed cross-attention drop probe is used for dropout as the probability that we will switch off neurons to promote generalization through the network this is typically 0.1 num layers is the number of encoder layers as well as the number of decoder layers and we can set this to be like 5 or something max sequence length is the maximum number of characters in a given input sentence which in our case we set it to about 200 kn vocab size is all the possible characters that can occur in the Kannada sentence or translation then we have English to index which is taking a dictionary it's a python dictionary that maps a character to a unique number same thing for Kannada and then we have a start token end token and padding token which would be the start and end and padding tokens respectively now during the transformer call to the forward pass we will be taking in x which is the English sentence or rather more specifically it's the batch of English sentences why is the batch of target Kannada sentences then we have a couple of masks here so we have the encoder self-attention mask which is going to incorporate the padding masks then we have the decoder self-attention mask which is going to incorporate both the padding mask as well as the look ahead masks and then we have the decoder cross-attention mask which will also incorporate the padding mask and these four other parameters are going to determine whether we should include a start token and an end token to our encoder and decoder inputs and outputs in this case false means that we do not include it and true means that we do so we'll first start by taking the batch of sentences x passing it through the encoder in order to get the list of character vectors that are eventually going to be context aware we then take those vectors into the decoder along with the batch of Kannada sentences which is why in order to finally get the output of that decoder and we're going to be eventually passing this into a softmax activation when we're computing the loss function if we take a look at the encoder so this x here as the input is going to be the batch of English sentences we're going to construct the sentence embeddings this is going to eventually take every single sentence we're going to use the character we have to map it to integer values for every single sentence and every single character in the batch and then we're going to pass it through the encoder and the self attention mask is going to be just the padding mask because we in the English sentence are allowed to look forward and look backward so we don't need that look ahead mask here and so we're just going to get a list of contextually aware characters in the form of vectors the decoder on the other hand is going to take these list of vectors x but also the batch of input sentences y we're going to perform the sentence embedding that is we're going to perform like the tokenization of characters and then we're also going to include whether we include like a start token or end token and this depends on whether it's the input to the decoder or the output of the decoder so the input to the decoder is only going to include a start token but the output is not going to include the start token but it will include that end token we then pass it through the decoder passing in the self attention mask and the cross attention mask and then we get the output which is a set of Kanada character vectors so that's the entire transformer neural network code and now we want to use this code in some way so I've put together a trainer notebook which you can use to train a neural network I've already gone ahead and trained it on the task of English to Kanada we will go through that trainer notebook code step by step and we'll also go through some fun inference translations just to see how well the transformer neural network did so let's get started before I get started with this video I wanted to give a quick shout out to this user account Tiger07 where they helped point out a specific issue where in my last training videos for building out the transformer I kind of made some errors they pointed out like two lines of code that I kind of needed to change throughout the network and this is the line itself it was just reshaping up these tensors because I think it was PyTorch when I tried to do I was just doing a values like some reshape which would have completely discombobulated all my tensors I needed to do a permutation of some layers it's a very minor technical issue but it was a major one in the sense that it stalled me for a very long time and I'm really grateful that when I reached out to the community you all responded very well I also wanted to give a shout out to this account PingNG sorry for mispronouncing that if anything who recommended the same exact solution but also detailed why that was the case so thank you so much as well and that just shows how well this community does respond because even Slack nor ChatGPT could really help me out but you all did so thank you so much for now let's just go to the part where we're actually instantiating the transformer and the only thing I really want to point out here is that we are going to use only one encoder and one decoder layer so this is going to be the simplest of the simplest transformer neural networks so that it could just train this faster to see some reasonable results that's the only reason I did this and I did this training for about one and a half hours where I trained it for 10 epochs where I'll scroll here you see that's 10 epochs with the data set of around 200,000 English sentences to translate into a language called Canada and all of this training epoch I've printed out 100 epochs at a time for now let's get to the good part of transformer inference so this here is a character language model where we're generating one character at a time and so you'll see everything is going to be within a for loop where I am generating one character at a time you generate the token and append it to the sentence and just keep doing this until you generate an end token which will signify the end of a sentence on doing so you'll get a few examples so let's just look at a few of them when I say to try to translate what should we do when the day starts the the translation it gives is this sentence over here which says what do we do about this although it doesn't translate exactly to you know the sentence at least we can kind of see that there are some commonalities that are retained like this last two words over here and these last two words over here it is retained in some way so that's these two words that just means what should we do and that just corresponds to this part of the sentence so it kind of gets a part of the sentence right but very clearly it distorts the entire meaning so it's not quite getting everything correctly you can attribute this to a few cases so the first thing is like the model is just too simple it only has like one encoder and one decoder and if you increase the number of encoder and decoder components you probably might be able to pick up on more idiosyncrasies in language that's probably the biggest reason but you can also try increasing your training set or increasing the number of epochs for your training time now the second sentence is how is this the truth the actual translation is idu hege satya whereas this here is the generated translation which says hege idu hege hege so this is not really a meaningful sentence but you can see that there are some commonalities between them again so you can see this idu hege is generated as well idu hege so idu means like this hege is how satya is truth so it translates to how is this the truth so it does generate some part of it and you can see just by looking at these two examples you can kind of see that it is definitely learning something although it might not be complex enough to pick up everything about these sentences now this example is my name is a j which should translate to but it translates to so you can see again there are some commonalities with which means name none is my but it's very close to it but it means me it did not pick up at all on my name at all also although the overall translation is off once again we see some words that are actually common and correct now this one's interesting why do we care about this the actual translation that it gave was with punctuation it'll be like why what's the reason that would be the actual translation of this which is actually very close to the translation of this initial sentence over here so not bad it did pretty good there the next is this is the best thing ever whereas you know here it generates this sentence but actual translation is this sentence which translates to this is very unusual so though the the meanings are kind of off you can see that there are just some some commonalities between them again now this is probably the most interesting example throughout the lot where I wanted to translate I am here the actual translation is the projected translation by the translator was so there are different meanings this means this means I am here whereas this means I've heard something and although there are different meanings though you can see that from a character by character generation that this translator is performing it's actually doing extremely well the only thing in the translator's eyes that it got wrong was this e and this k like these two alphabets are like the only thing that are different in this entire translation boring some small like la this they're both loves but either way just one or two characters are the only things that are wrong so in the translator's eyes this translation is actually a pretty good one but this kind of made me think more about the fact that okay this is a character translation but in general word translations might actually perform better but the the caveat of using word translations is that your transformer will need to have a much larger vocabulary as opposed to what it has now so you can see if I scroll up to see like what is the length of any possible characters that are possibly generator tokens that are generated you can see it's only a small set of values here maybe like a you know maybe a 50 to 100 tokens or something like that but if it was words then it all possible words were in this list this entire list would explodes to the tens of thousands because there can be so many words that need to be generated so there's always this tradeoff between larger vocabulary size but interpretable values so you need like much more complex systems like you need probably a more complex translator and also way more parameters in order to account for you know words themselves but with words though the sentence length will technically decrease while here I have provided like the maximum sentence length to be 200 characters the number of words in a sentence doesn't have to be like 200 characters 200 words it could be just you know a dozen or something like that that we can cap it out here's an interesting one too it says click this and this would be the translation which is either new click Mardi which is like this but the actual translation is kyan click click click click click click click click Mardi so although it does get this last part right click Mardi and it does get it here it just loves click click click click click click so it's just funny but it is once again understanding what the task is at least to an extent now the same thing is here where is the mall the translation is well the translation that it generated was yeli yeli yeli which is where where where so at least you got the where part is but it didn't generate anything else now what should we do the translation is enumarabeku now it generated this correct but it absolutely fumbles on this one here today what should we do this I have no idea why it generates this it says adannumaridara mele marida maridare so it just loves marida mari is the is to do and I guess that's like a very common phrase that you see everywhere in both English and also in this language kanada and that's why you're seeing like all kinds of forms when you see like oh do it probably tries to do mari like every single everywhere it just tries to to create this scenario which again very interesting but it completely fumbles despite you know in English when we see this it's kind of like the very same sentence as what we did before just an extra word so that's just a it's an interesting note but if I phrase it as why did they do this this sentence actually generated almost perfectly well but again this is something to do with doing something so it's a very common common sentence in general or a common phrase so that's probably why it did so well this last part here is also a very interesting one it's what's the word on the street and the generated translation is either a bug game so what is the topic of this or what is this about is the translation which kind of does semantically relate to to what this actually means this little idiom here let's now go through some insights where I'm probably going to give you some information and some tips when building out a transformer on your own with any language so first of all I want us to say create a translator with a language that you understand ideally because it's just so much easier to see where the transformer is doing things correctly and where it's also doing things incorrectly so generating that insight for yourself I think is very important and you can better do so if you understand the language itself so in this case people were saying why did you use canada it's because I can understand and I can properly evaluate it otherwise I wouldn't be able to come up with the insights that I did piggybacking off of that here's a I think a pretty important insight that I haven't really seen anywhere but I'll describe here so when training typically the English character set is known as what we call an alpha bet where every character kind of has a phonetic representation to it in a language like canada it's more of an alpha syllabary so you have individual units to actually be complete like syllables themselves and in doing so that means that even though like for example this word I think I typed it out here so this is a character this is ma ma if you write it out in English should be m a with like a you know like an accent on top of it that would be so it's like multiple characters in English but it's a single character in the canada language however when I was dealing with tokens here the way that I'm tokenizing the data is I'm also treating it as like multiple tokens so it would be ma plus a even though in the current language it's actually supposed to be one character I am treating it as two characters and so what semantically just makes more sense is to create a tokenizer that will not just divvy up the entire you know canada like word into very sub characters but rather divvy the canada word into actual canada characters themselves which may or may not be a combination of two or more of these characters also alpha syllabaries are a type of script that are not confined to just this language canada there are many alpha syllabaries out there and so just understanding the writing style may actually create a translator that is more meaningful and so I highly recommend you try this out now another the another insight that I mentioned is kind of similar to what I described before this is a generating one character at a time so it's a smaller vocabulary but longer sentence length but you can play with generating like word at a time where you'll have a much larger vocabulary but smaller sentence length and a good mix of both worlds is to use something called byte pair tokenizations or byte pair encodings which are like sub words now the issue with this is that it's very hard to create a byte pair encoding for certain languages if they don't you know if they're not really don't have a great researcher online presence so it's hard for me to find one for the language canada and hence I went with character tokenizations for now to illustrate concepts and ideas but if you're able to create like byte pair encodings for your languages input and outputs then I think that might be like a good starting point in fact I think this is exactly what's happening in the main paper and a lot of other research associated with generative models these days another one is to make sure that your training set has a large variety of words in general you could see that above when I illustrated here there's a lot of sentences that are like to do right moddy is there's like a lot of these sentences that that kind of go moddbeco and moddy over here in fact if you look at this data set there's actually 10,000 cases at least of like there's like millions of records here but there's like 10,000 cases at the very least where we just have this entire word called moddbeco which is I have to do and that's a very common phrase so I would suggest you try to plot out every single word and their frequency counts just to get an idea of what kind of data set you're dealing with whether it's very catered to like news government articles politics or if it's catered to just like general and random sentences which ideally would be the case for general translators and the other one is just more technical where you're increasing the number of encoder and decoder units as I've only used one keeping it very simple but you can ideally try with more encoder decoder units to pick up more complexities and intricacies in your languages overall yeah this model has definitely learned something and you can use it for you know other languages instead of kanada as well and that's going to do it for this series so I hope that right now you have a good idea of how the transformer works its code its architecture and all of the code for the entire series along with the diagrams are going to be in this github repository which I'll put the link to for in the description below once again thank you all so much for 100k onwards and upwards as the corporations say and I will see you in another video bye bye