 Okay. Okay. Okay. So today we're going to be sharing the screen. Okay. Share screen. Share. Boom. And here we go. All right. So foundations of deep learning. That's me. Follow me on Twitter. Yay. Of course. Okay. So today we're going to be talking about attention. And specifically, there are two types of attention. There is self attention or cross attention. And there's going to be also hard attention or soft attention. Okay. But in generally, generally, when we talk about attention, we're going to be talking about dealing with sets. So since I'm going to give you a small preview, transformers are made of attention modules. Transformers are going to be something that maps sets to sets. They don't really deal with sequences. They can, right? A sequence can be thought as an order set, but they don't necessarily need to have order sequences. That's the super cool part. All right. So let's get started. Let's think, let's start first with self attention. So we have like a set of axes. This is my pink axis, which are X subscript I, with I goes from one to T, right? So you have T different elements in this set. X1, X2, blah, blah, blah, XT. So now we can think about each of these XIs as belonging to RN, right? That's our classical representation we have seen so far, right? So it's like an n-dimensional vector. So all that there is to this self attention is that my H, my hidden representation, is going to be a linear combination of these vectors, right? And so I guess if you remember from one of the classes early on, like I think the fourth class, I show you how you can do like a quick notation for writing this linear combination of vectors, right? And that's simply by using a matrix multiplication. So this set of axes, which are T axes in dimension n, can be thought as a matrix, capital X in pink, that has n rows, because the height of these guys are n, and then has T columns, right? So you can think about that as basically the stack, the horizontal stack of these vectors, right? So this is my set in which again you have an order right now. And so now these hidden representation can be written as what? The matrix X times the vector of those alphas, which I'm going to be calling A in a bold type set, okay? So in this case, my H is the X matrix times A. So it's a bit funky, right? Usually when we think about the hidden representation, we have a rotation of my input, right? With the weight matrix. But in this case, it's like, huh, what is this? Is the rotation of the attention? So it doesn't kind of, I can't really think that way. So I prefer to think about this as, you know, the linear combination of these columns of X, right? So the linear combination of those axes, cool. And so we can write it there, top right, just to keep it in memory. Right now, we can think about hard attention. If we impose that the zero norm of this A vector, it's equal one, and then on zero term equal one as well. So which means A is a one hot encoded vector. And therefore, what has happened if you multiply these X times one hot vector, you're going to select the, you know, the specific where the one is going to be that specific column, right? So if the second element of A is equal one, and all the others are zero, when you multiply a capital X time A, which is zero one, all the others are zero, you're going to just retrieve the column, second column, right? So you can see now how this can work out, right? So this attention just pays attention to one element of your set, right? And one element only in the case that we are, when we are talking about hard attention. But we can also have a different attention, which is the soft attention. In the soft attention instead, the constraint is that the summation of these elements of A, the alphas, they have to sum to one, okay? And so that's the difference. And in this case, H is going to be just, you know, at the inner combination of these, the columns of the matrix capital X, right? So far, cool, right? There's no weird stuff, right? Okay. So second part, attention number two. We're going to figure out where these A's come from, right? And so in this case, A, as I, as you can figure out, is going to be the argmax or the soft argmax of these capital X, where I transpose it. So that you mean every row is one sample. And then I compute the scalar product between the row and the vector that every element in the final product, every element is going to be the scalar product of each and every, you know, vector XI, that was from 1 to T, against this specific X, okay? But so is it clear so far? So before we said, what's beta? Beta is the parameter for the soft argmax, right? The inverse of the temperature whenever, you know, soft argmax, right? The X of the argument divided by the sum of the X's. And then inside the X, we have beta, right? Where does beta come from? Yeah, and that's the, whenever you have the soft argmax or what people call softmax, there is always a beta parameter. Usually it's set to one, so you don't see it. So before we saw that my capital X, okay, I didn't, okay, I didn't talk about yet. So the big X, right? Big X is this set of columns, right? Tata, tata, tata. And then the H is going to be a linear combination of these columns. But where are these alphas coming from, right? And so my given alpha, my first vector A, is going to be, you know, each and every basically column of X, so it means every row in the X transpose, you multiply by X, right? And you'll get the scalar product of these two guys. And so you get how much my X, what is the value of the scalar product of my X against every other value, every other X in my set, okay? Is X a column of vector, yeah, X is one vector size N, okay? It's like it was written before. X is a generic X, right? So my square bracket represents an optional argument. So you can have either the argmax and therefore you get a one hot A. Or if you have the soft argmax, which is the one that we've been using all the course so far, you're going to get the exponential divided by the summation of the old exponential, right? That's the classical one, the one that people call softmax usually. All right, so but then we said that we have a set of Xs, right? And so if you have a set of Xs, these will imply that you have a set of A's, right? Because for every X, you're going to have an A. And so if you have these many A's that are vectors, right, you can stack them one after each other, and you get capital A. And so capital A is going to have the height, it's going to be of size T, right? Because the size T, so you know, the lowercase A has size T, just because you're going to have T rows in X transpose, right? And then you stack T's of them, right? Because you have T axis. I hope it's clear so far. All right, cool, no? What's missing right now? So what's missing is the following. So given that now we have a set of A's, therefore you're going to have a set of H's, right? If you see from the top right side of the slide, you have H was this, you know, matrix multiplication between the capital X and the A, right? So H was the linear combination of the columns of my X matrix, right? But given that we have many A's, we are going to end up with many H's, right? So how many H's? The same number of A's, right? And so you're going to have these capital X matrix, which is going to have many, many columns, right? Hey, sorry, so the small X in the Soft Algmat equation that belongs to our N, right? That's one of the, one of the columns of X? Yeah, yeah. So I would, you can call that XI and I would have A would be AI, right? But I removed the index so it becomes a bit less cluttered. Okay, but so X is D by N, right? So X transpose should be N by T. So X is going to be N by T, right? Oh, X is N by T. So X is going to be the stack of all these columns, one after the other, right? Okay, I got the dimensions the other way around. Thanks. Yeah, so there are two options. You can think about X being the set of rows and this I think was done usually in code. But I think this is much easier to write down in this way if you write the math. I mean, I prefer. Okay. And so again, given that we have many A's, you're going to have many columns of the H matrix, right? And so we can simply write that my capital H, which is the set of these H's is going to be the linear combination of these element of X using the factors like in the first column of A and the second column of A, third column, right? So that was pretty much it about attention. So you basically mix the components of the set of X's, which we can represent as a matrix by using these coefficients that are computed by using the argmax of the soft argmax where each component inside, we can call this stuff inside here a score. It's simply the scalar product of one given X against all the sets of my X's. Okay. So this is the first part of the lecture. It should be clear until here. Otherwise we can move forward. So is it clear so far? Or is not? Could you please explain the first equation again for one, what is soft argmax and why you multiplying big X with X? Right. So in the in the previous slide, we simply said there is one line only here basically, and we say that the hidden layer, it's going to be a linear combination of these X's, right? Yes. And this linear combination is using these alphas, which is contained in this vector A. Yes. And so I wrote this one here, right? So H is going to be the linear combination of the columns of the X, where the column of X is going to be the elements in my set. Yes. Okay. So if you got this one, we go to the second slide. And I tell you how to compute one A, okay? So to compute one A here, this one A is going to be, for example, the soft argmax, which is the again, people call this softmax, of what? So you have a specific X here, right? When you compute the product between these X transpose, X transpose is going to have all those X's in rows, right? So let me draw this one, right? So X transpose is going to have like first sample, second guy, third guy, right? And then I do this one against my guy here, right? So if you do a matrix vector multiplication, the first item is going to be the scalar product of the first guy against myself, then the second guy against myself, and the third guy against myself, right? Okay. Then then you get here, for example, three scores, right? You're going to see how close, how aligned basically your vector is with respect to the three items in my set. And then I run the soft argmax, right? So I get that these three values you get at the end, one value, two second value, third value. At the end, it's going to sum to one, right? Yes. Okay, got it. Nice lot. Sure. And you can have either one, you can have the soft argmax, you have the exponential, blah, blah, blah. Or you can get the argmax, right? And you get, which is basically sending the beta to very large values, right? So you can just simply write soft argmax and have large beta. I'm still confused. Why is this a vector in RT? Which one? Line the timing A. This one, right? Okay. So X transpose is going to be my axis, right? In this direction, right? So this guy here is my X transpose. And this length here, this is going to be n. And this is going to be t. So if you do t times this guy, which is also n, right? You're going to get a vector of size t, right? Does it make sense? So is A a one-hot vector? A can be a one-hot if you use argmax, or it's going to be a softer version if you use the soft argmax, right? Which is the X divided by the sum of the X, right? Well, when you multiply a matrix by a vector, you get a vector. So if you're taking an argmax over that, you should get a scalar, no? No, the argmax is going to give you the index, right? Like the one-hot vector corresponding to the vector, right? You can think about argmax as giving you a one where you have the maximum and all the rest are going to be zero. Okay, sure. Does it make sense? Yeah. So if you have like three, seven, nine, two, a vector of four, the argmax can give you zero, zero, one, zero, no? It's going to give you the one at the position where the maximum is. Okay. All right. Well, is beta a scalar or a vector? Right now, you can think of that being equal to one. There is no need here of this one right now, okay? For a little bit of additional clarification is are the XI terms, are those one-hot vectors that represent our input? XIs are just my input. It doesn't have to be one-hot. But they could be if they represent words or no? Is that the one? I think they are, usually they are embeddings, so they are actually dense. So I don't think they are. Oh, okay. Then the X transpose by X is that's kind of multiplying, that's determining the similarity between, okay. So this one is determining how similar each element in my set of X's, it's similar to my X, right? So this A tells me how all of these guys here are similar with respect to this guy over here, okay? Got it. Thanks. Cool, cool, cool. All right. So this was the first part. Let's move on and see how this can be improved and like expanded. So here we have a definition. The key value store, this is about data structure just to give you a little bit of definition, right? So this is a paradigm for storing, saving, retrieving, or querying, or managing an associative array or dictionary or a hash table. What does it mean? So for example, you want to type some, you know, a question on like, let's say you want to check a video about how to make lasagna on YouTube, okay? So you go on YouTube and you write lasagne or lasagna, whatever, you press enter. So you have a query and then the query is going to be checking against all possible keys in your dataset and the keys could be like the title of the video or the description, right? And so you check how a line is your query against all those titles that are available in the YouTube dataset, right? When you find the maximum matching score, you can retrieve that one, right? So if you do the argmax, you just retrieve one video. Otherwise, if you do the soft argmax, you can get basically a probability distribution, right? And then you can do retrieve in order, right? You can retrieve first the most aligned video and then you have, you know, you can have a sequence of less and less relevant videos, right? So is it clear so far about what is this key value store paradigm, right? So you just have a query, which is your question. Given one query, you check all the keys and you can find how matching the key is with respect to your query and then you retrieve all those videos or those values, right? All those content. And so we're going to do exactly the same here in this case, right? So we're going to be specializing a little bit what we have seen so far, which was pretty, you know, trivial. The key here would be the title of the video. Yeah, that's correct, right? So the keys are the title of all the videos in YouTube. You have one query, lasagna. Okay, how to cook lasagna is half. And then you check these question against all the keys. And then when you find, you know, the argmax, you find the index of the highest, you retrieve that one, right? Or again, if you do the soft argmax, you're going to get like probabilities. Now you can sort by probability, for example, right? So we have now queries, keys and values. So what are these? Well, these are simply rotation of my specific input X, right? So my Q is going to be getting my X here, I rotate it by WQ. Then my key is going to be again, my X, I rotate it by WK. And then I have my value V, sorry, I'm going to be rotating the X by a WV. Okay, so so far, we just added three matrices more. And that's it. So this is like how you can, you know, add so much more. Well, we finally add some training parameters, right? So so far, we didn't have any trainable parameters so far. What would be the would be X in the lasagna metaphor? X is me very hungry. And so me very hungry goes and tried to write some question about how to get the food done. Okay. But then given that I also know how to cook, I can also check against all, okay, there you go, right? So the about cooking lasagna, right? So I'm hungry and that would be my hex, hex. So my query would be, what is the best recipe I can find? And then I can check in my memory, right, in my own head, all the possible lasagna recipes I have all in my mother's cookbook, right? So I can check, check, check, check, check, and then say, oh, my grandmother lasagna. And then I retrieve the recipe from my granny, which is, you know, amazing. Makes sense, right? I'm hungry. Is there a reason why we don't add nonlinearities here? Yeah, yeah. So this is just this attention thing. It's completely based on orientation, right? You're going to just check in what is the orientation of these vectors. So this is how attention works. Okay, we don't like nonlinearities. The only nonlinearity is whenever you try to get the probability, you know, distribution, right? The soft argmax makes sense. Okay, cool. All right. So you're on board. So again, we first now introduce these learning parameters such that we can train something, right? Machine learning. Yay. Okay, so Q and K have to have the same length, right? The same dimensionality because you're going to check one query, one question, how to make lasagna against all the possible representation of the titles, right? So they have to have the same length because otherwise you can't check the orientation, right? They have to be in the same space. V, that's V is the content of my recipe, right? I don't care about the length. It can be, you know, five pages of the recipe, right? So it's just, you know, that's the whole recipe, right? So V in my case is huge. And the key is the representation of the title, which match the size of the representation of the question, right? Cool. Let's make things, you know, simple and dumbing down everything. We just set D prime equal D, second equal to D, right? So everything is just D. All right. So we said we have a sequence, we, not sequence, right? That I'm wrong. We have a set of X's, right? So given that we have a set of X's, therefore, you're going to get a set of queries, you get a set of keys and a set of values, right? And yes, as you can imagine, you're going to get a matrix, which is, you know, stacking all those columns of all the queues, all the case, and all the values. So how many columns you have? You have T columns, right? Because you stack T vectors, right? What is the height of the vectors? Well, D, right? Because you, we just said, okay, cool. So what next? So before we said that my A was the soft argmax or the argmax of, now you're going to check one query against all these keys, right? So we tilt first the K such that I have my rows, right? And then I multiply my first row, my first key times my query, second key times the K query, third key, blah, blah, blah, right? How many rows I have? I have T, lowercase T, right? So at the end, you have T scores. And then you compute the soft argmax. And therefore, you get the probability, right? I think it makes sense. I mean, it makes sense to me, I spend a bit of time, but yeah, does it make sense to you? No, yes. What's the difference between Q and K? K represents the key, which is the title in the recipe, recipe book, right? Q is going to be my question. I want to make lasagna. And then I check all the titles of my recipe book, like how to make pizza, how to make pasta, how to make ravioli, how to make tortellini, how to make polpetone, how to make lasagna. Hey, there you go. You get a high score. Then you retrieve that one. Okay, cool. Hey, Alf. So I understand why the query is being derived by a transformation of X, but I don't understand why K and V are also being derived from X. First of all, there is a value. So in your analogy, V would be a video. So why do we derive it from? Yes, yes, yes. You're completely right. And so that's going to be the next slide. But this one is called self-attention. So you are actually doing a retrospective work. You're actually thinking in your head, I want to make lasagna. So let me think even harder what are the recipes I can make and found it. Then I have my recipe. So everything is my head. And given that my head is X, I just have the three things coming from my head. And this is called self-attention. Otherwise, if you're a bit more normal than me, and you just go and get a recipe book, you're going to get that the question comes from your X, from your brain. But then the keys and the values come from the book. So keys and values, you're going to get them somewhere else. And that's going to be the cross-attention. And you check your query against all those keys. And then you retrieve the final thing. All right. So what's next? Next is going to be the fact that, yeah, we said the hidden layer is going to be my linear combination of these Vs, no, these V columns that are making my metrics, again, weighted by these coefficients, alphas that are inside my A. So this is exactly the same as we have seen before by I just specified, specialized, yeah, specialized those instead of using all the time X and, you know, X transpose, I have now keys, queries and values, right? Queries, keys, queries, one side and then keys and values on the other side. Alfredo, yesterday I lecture you and Professor Lacun says there's only one, maybe like one query, but multiple, yeah, one query and multiple keys and values. So here one means there's only one query for one X I, like one QI corresponds to one X I, but it has to interact all the other keys. Right, right. So exactly that was, okay, thank you for reminding me that. So the point here in this line, the last line I show you here, you have one Q and one Q, this is one question, how to make lasagna. You're going to check how this matches all the titles in the book, right? So one question is going to be going through all the titles, in order to find where the correct title is, right? And so one question, how to make lasagna, and you check how to make pizza, how to make pasta, how to make tortellini, how to make polpetone. And then you retrieve the one that actually matches, right? So you have one question, you check many keys and then you retrieve the value of the one that matches. Or if you actually have two high scores, you can make a mixture of recipes, right? And then I don't know how well that interpolate. Does it make sense? Wait, but someone was talking before. I'm ordering by two recipes. Right. So if I check how to make lasagna and then in my book, I have lasagna with the E at the end and then I don't know, some other thing that sounds very similar. So maybe if the word you're looking for, let me think another word. Let's say I want to make pizza, but then there is another recipe which is called chicken pizzaiola. Pizzaiola is like with pizza, but similar with pizza. But if I look for pizza, also pizzaiola is going to be having like a higher matching score, right? And so if I make pizza, if I take the argmax, it's going to work fine. If I take the soft argmax, I'm going to get a combination between pizza and pizzaiola because they are similar. So you're going to get some mass, probability mass coming also in the other guy. And then when you retrieve the values, the values are going to be again the linear combination of those columns using multiply by these coefficients. And so if you have a one hot, you're going to get just one value. But then if there are multiple values, which are, you know, known, like if you have like a soft argmax, you may have, you know, several values that are going to be mixed together. You say we get two very close candidates that has high a value say that it's pizza and yola, but here my Curie, the Q is only one, right? So Q is going to be pizza. But then inside the key, there are two recipes, which is pizza and pizzaiola. And these two are very similar. So both of them, the score will be somehow similar, right? And so whenever you do the soft argmax, you don't exactly get only one very high score, but you're going to get a high score and then another high score. So H, which is going to be the linear combination of these recipes, are going to be like the average of the pizza and the pizzaiola. Got it. Okay. Yeah, it totally makes sense. Thanks. Of course, there was another question. No, okay, maybe we should move on because we have 15 minutes left. Okay. Again, how many cues do we have? We have T cues, right? We have many cues. And therefore, oh, hold on. So what's beta? Beta, we like to set that to one over square root of D. Why is that? Because if you have a vector of all ones, in one dimension, the length of a vector of one is one. In two dimensions, the vector which is all coordinates one is going to be square root of two. A vector that has all components one in the three dimension is going to have square root of three, right? If you have four components, it's going to be square root of blah. So a vector in D dimension, you're going to have that the magnitude grows with the square root of the number of dimensions. And so in order to keep the temperature constant of these soft argmax, we want to divide by the square root of the number of dimensions. Again, technicality doesn't matter if you don't get it. Again, how many queries we have? T. And therefore, we can get TAs, right? And therefore, we get a matrix capital A. So finally, you're going to get a big H, which is simply the multiplication of these values by these metrics of where the columns are going to be the mixing components. And that was pretty much it about cross-attention. Could you just say exactly why Q and K are of the same dimension, but V is different? I would expect K and V to be of the same dimension, but Q to be something else. So V is going to be my pizza recipe, right? It's going to be 10 pages long, oh no, okay, one page long. Instead, Q and K is the question and the title, and they have to match because I'm going to be comparing, like I'm going to compare what is the matching, you know, degree, right? How these two are aligned. This comes from this thing over here, right? So whenever I check my query against all keys, I do the row times vector and they have to be having the same size, the same length, right? Otherwise, I can't multiply. So you check your question against all these keys and then you get a source, right? Yeah, I think that makes sense. And then you get the recipe, which can be the whole YouTube video or me cooking or whatever, right? Right, okay. You can come over, but okay. Thank you. Sure, anytime. I'm hungry. Okay, okay, okay. So one implementation detail, no? For example, to make things faster, we can stack all those W's in one tall W, right? And then boom, you compare, you compute all Q and K and V in one second, right? In one iteration. Nothing fancy, right? And we have done the same before for the RNN, right? Remember, we were stacking the X and the H and computing these with the stack version of the W, right? On the top right-hand side. Okay. So nothing fancy. Oh, then there is something else called heads. So this can represent one head, but we may have multiple heads. For example, H heads. And so if I have H heads, I'm going to have H Qs. I'm going to have H Ks. I'm going to have H Vs. And so you're going to end up with a thing that is going to be H times taller, right? But then you can still get it back to whatever dimension if you multiply this big guy at the end, not this final big vector, which is going to be the vector of the Vs, right? At the end, you can multiply by a matrix to make it back to this size of D. This is a possible way of implementing this stuff, right? But again, details not too important. So let's get finally to this transformer. What the heck is this transformer? So the originally this transformer is made by two blocks. It's made of an encoder and a decoder. Okay. So where have we seen before is this encoder-decoder architecture? Right in the channel. Out encoders. There you go. So recap from out encoders, right? So out encoders, we have the diagram on the left. Today we are going to be focusing the diagram on the right-hand side. You have two blocks. You have an encoder and I have a decoder. The encoder maps the X to the hidden representation and then the decoder maps the hidden representation to this again, input again. But we don't have to, I mean, we have these two main major components, right? In the out encoder. And so in this case, we're going to have something similar, more or less. So this is the transformer encoder, which is the purple block, right? So in these guys, in this guy, we're going to have the self-attention. Cool. We already know what it is. And the upper part, we're going to be running basically a linear layer per every component. So if you think about a convolutional, a convolution with the kernel size of one, you basically apply the same linear layer to every element in the set, right? Sometimes they call it this, you know, feed forward, but it's going to be a feed forward applied to every element in the set. So it's actually a convolution where the convolutional kernel is equal one. Then we apply some module, which we can call this add and norm both after both of these guys. What is this module here? So this guy is basically a box here, which has two components, has an addition component, and then has a layer normalization. And so if we connect this guy here on the right hand side, you're going to get that the self-attention basically has a residual connection that is bypassing it. And then there is the layer normalization. And the same happens for the other guy on top. So also the convolutional part has this residual connection and the layer normalization. So how does it work? Overall, this stuff, you're going to basically put the set of inputs at the bottom, and then you make it blah, blah, blah, blah, blah, bubble up, right? And you get the hidden representation at the output of the encoder, right? So this H encoder, nothing fancy, right? I just put two blocks. We have seen the self-attention just before. The 1D convolution, you exactly know how it works, right? It's applying just, you know, the single multi-layer perceptron to every component in this set. And then normalization helps, you know, getting those gradients coming back later on, and there is still a connection, connection makes everything smooth. All right, so this was the encoder. How do we, what is the decoder in this case? So let me clean up a little bit. So let me remove first the encoder box. Let me remove that horizontal line. Let me actually even remove these X on the bottom and the final output, right? So this was the encoder, but now I'm going to delete the connection there in the center. Okay, and so now we're going to be talking about the decoder. So the decoder is exactly like the encoder, but I'm going to have a cross-attention, like someone of you was mentioning before, of course, right? You were asking me, why the heck are you checking these keys from yourself? So this cross-attention gets connected right after this normalization module, right? And of course, cross-attention gets these hidden representation from the last layer of the encoder. And then what else? You're going to have the same stuff. So the addition and normalization, you connect this stuff, and then finally, you plug it back, right? So we have one extra module. And this is going to be my decoder, right? So the decoder is like the encoder, but has this additional module sandwiched between these previous... between the two, right? Could you say more about the cross-attention? Cross-attention is exactly the self-attention, but you have the fact that my keys, like this X here and this guy here are no longer X, but these are going to be my H from the encoder, right? That's it. Encoder. And this was the set, right? So it's still like the set of H's, all the same. And whatever. I can't try it with a mouse. So this is exactly the same where I just replaced the X's with the final hidden representation from the encoder, right? So that's it. This guy here is going to be providing me the values and the keys. So we still use the original X's to compute the query, and then you use the H's to compute the... So this one, you're going to get the X to compute the... the queries here. Here, I compute the queries. And this one allows me to compute the keys and then the values, okay? So how does it work? How do we train this stuff, right? So what do we get there on the bottom? From the bottom there, we're going to have the output, this Y hat, from the previous iteration. And so you're going to get a output there at the output of the system. Maybe there is like some additional layer missing there. And then in an auto-regressive fashion, you're going to get this output and you fill it back. There is some additional layer on top. It doesn't matter. I mean, it's not important. And then you put it back and then you, you know, you auto-regressively output a sequence of outputs, right? And then every time you have a new input, this new input will ask for a different query. And the different query is going to be asking about different values from the encoder, right? So the encoder basically summarizes what is the content of my input set, right? So before we saw that this guy here, yeah. So here we have a input set. And then the output of this guy is going to be a set of hidden representation. And then you have the decoder is going to be querying what is required through this queue from this set of representations from the encoder, okay? And we really have to go to the notebook because otherwise we have no time left. So import stuff, whatever. So here we have the multi-head attention, right? How does this multi-head attention work? So in the init part, we're going to have these three matrices, wq, wk, and wv that are allowing me to rotate my current input, right? And then we have this one that is allowing me to merge together the heads at the end. And so how does this forward works, right? So in the forward, you're going to get a input x for the query and input x for the keys and an input for the values. And then you have the queue, the k and the v are simply, you know, the multiplication of my input for the specific item multiplied by these metrics, wq, right? So this is the rotation of my x. And so here you have the queue, k and v. Then you're going to compute this scale dot product. So we can go and see this on scale dot product, which are basically the dot product between the one question against all the keys. So if we go up here, I can't even scroll. Sorry. Let me zoom a little. One second. Okay. So here you get this, right? So here you get basically the discourse. Okay. First, we divide by the square root of the dimension because we said before otherwise stuff starts exploding. Then we have a metrics multiplication between one query against all these keys, right? And then at the end, we apply the soft art max, right? Such that we can compute the squares, sorry, we can compute the mixing coefficient, right? And then finally, you multiply these mixing coefficients with the v metrics and you get basically the final output, right? And that was pretty much it. This was the self-attention, right? And then finally, since you have multiple heads, we're going to be squashing everything together by using this final wh. So that was the first part, the attention. Are there questions on this attention? I mean, if you follow these slides, this is exactly the same. Okay. Just let me know if you need me to go slower on this part. Then on the bottom part, so what do we need? We have the attention part. So this is the self-attention, right? So what are we trying to do, right? And this is a multi-headed attention. So what are we trying to do here? We are going to be just using encoder to classify some sentences, which are sentences describing some movies, as being a positive review or a negative review. So I'm just going to be using the encoder and then train this encoder to perform a classification task. So what do we need for the encoder? The encoder, if we check from these slides, so if we check here, what do we need for the encoder? The encoder has two components, right? There is the self-attention, which we just saw the code. And then we had this convolution, right? This MLP, multi-layer perceptron applied to every element in the set. So let's figure out where this convolutional layer is. Some sanity check. Hold on. This stuff is going to be online by the end of today. So there is this encoder layer, and the encoder layer has this multi-attention plus the convolutional net. And the convolutional net, you pretty much know how it works, right? So you can actually figure out, if you check the PyTorch documentation, that the linear NN linear acts as a one-dimensional convolution. So you can use NN linear here. So this is a convolutional net. You have a convolution, the reload, and then you have the final convolution. Okay, I think I'm not running because I think you already know how to make a convolutional net. And then we have the two-layer normalization, right? So first you have multi-head attention. It's self-attention. So we provide x, x, and x for all the inputs, right? Then you have a stupid question, but why call it a convolution at all, if it's just a thing? Okay, it's not stupid. It's very important because the linear layer would be mapping, you know, some representation into some other representation. It's applied to every component in the set, right? So I have a set here, right? I have a set of inputs. And over here I'm going to have a set of representation, right? So I have a set here. Then I apply the linear layer to every element separately, okay? So if you apply the same linear layer to every element in a sequence, a convolution, right? Okay. So each hidden representation is going through the ML, going through the one convolution separately. Yeah. So in the original paper, they call it linear layer, but it's not because it's actually a convolution, right? So again, every time you're going to see every implementation is going to use a linear layer. All of them are going to be calling fit forward. But this is a convolution. It does some broadcasting, but again, it's a convolution. The same way we call soft argmax. And we call it, I don't call it softmax because, you know, it's wrong, right? Okay. But it's a very good question. I'm almost done. I'm almost done. Okay. So this is my convolutional net. And then you have multi-layer attention and this CNN, right? And again, this is simply a one-dimensional convolution where the kernel size is also one, right? And so it can be implemented with a linear. But I would write here and then one deconvolution, right? With the kernel of size one. Actually, they are implemented in exactly the same way if you check the code on PyTorch. But this would be the correct, like, is a convolution, right? All right. So you have the multi-head attention, and you get the first output. Then you send this output. You sum this to the input, right? Because we had the residual connection, and you send it through the layer normalization. And then you have an output. Sorry, you have an output 01. You send the output inside the convolution. You get this guy, which you're going to be bypassing, you know, with the residual connection, and you feed it to the layer normalization. So that's the encoder, right? So the encoder puts all, that was the encoder finish. Something I didn't mention was that, let me go back here. So I'm using this encoder to do some sentence classification, right? And so each, there is actually an order, you know, in the words. If you put a beg of word, this is basically like acting and working on a beg of words. But if you actually would like to make sense, you'd like to also send an index, right? So the first item should also, in the set, should have, oh, this was first item. So you also should send information about what position that item takes, okay? So so far, this encoder and this transformer and this attention is completely permutation-equivariant, right? Because we don't have any information about order. But if I like to do classification of sentences, maybe it makes sense to take in account the order of words, right? Because, you know, order might matter. And so we can add some kind of position information. But again, it's not too important. All right. So I have my encoder, which is going to be just having the embeddings for the input. And then it has several layers of the encoder. So over here, I just show you, oh, oh, where did they go? Okay, okay. So this is just one encoder. But since we are doing deep networks, you can, you know, you have multiple encoders after, you know, deep networks. Each of these is an encoder, right? Encoder. So you can stack multiple of these encoders to make your network more powerful. And so here you have like a list for a number of layers. You append several encoders together. Then we train this stuff on this IMDB dataset, which is basically giving me the reviews of the movie. And then we have to figure out whether it was a good or a bad movie. And pretty much that's it. So we train these big guy. And I just keep the training loop because it's going to be exactly the same train loop we have seen so many times. You get some, you know, accuracy at the beginning, it's 50% because it doesn't know better. And then as you keep training, we get up to some kind of 90, 92, and then we start maybe overfitting a little bit. And this is the test accuracy, which is 83%. So something you want to really pay attention is the fact that when you do sentence classification with a RNN, you have to send it multiple times, right? You have to send the first word inside, then you send the second word, then it's a sequential set of operations. Instead, in this case, this attention mechanism I show you right now, there is no sequential operation. Everything is computed in one go, right? So in this case here, you get this final H matrix, which is the representation of all the element in my sentence. It's computed in one go, right? So there is no more temporal loop. There is no more waiting time, right? It's like boom, immediately, immediately done, right? So this allow you to parallelize so much because this is just matrix multiplication, right? So this is like stupid to parallelize. One more thing to pay attention is that where is it? There is a matrix, the A matrix, okay, this guy here, this is very dangerous, right? T is the number of, T is the number of titles, right? Okay, there you go. So my recipe book has a thousand recipes because it's the 1000 Alfredo recipe book. What's the size of this matrix? It's a thousand times a thousand, right? It's a one million dimension. It's huge. And so you can see clearly here the fact that if you have many indexes, many, many, many keys, right? This stuff starts blowing up quickly, pretty quickly, right? So you have to pay attention about that. And there are different ways to handle this as well. For example, you can split in half and do something, but again, implementation detail. So right now we are 10 minutes after class. I am here for you and answer every kind of question, but I think we managed to go through the notebook in an okay manner. Questions? Two questions. Yes. So in one head, in one attention head, you would only use one matrix, one weight matrix each for the query, the key and the value, yes? Yeah. And then you stack all the final V's together. And then you can squash them back. So at the end, I may have h, hd, the dimension of V. And then I can use this matrix here to squash down everything to the dimension. This is a way of doing this, okay? Okay. So multi-headed attention essentially is just using multiple weight matrices that don't share weights. Yeah, something happened in the between. But yes, the multi-headed attention means you have multiple queries for the same input, right? And it allows you to have multiple questions about the same wheel, like you're hungry. So one question would be, how can I make lasagna? But you know that you don't have ground beef at home. And so a second question would be, hmm, can I do a vegetarian dish? And so, you know, given that you're still hungry, you may have different questions in your mind. Okay. And can you go to the slide with the encoder-decoder structure that used cross-attention? Yeah, sure. Here. Okay. So the input to the first self-attention layer would be the way you'd calculate the query in that case would be wq times xi, right? So say qi will be equal to that. So in this, in the self-attention, you mean, right? Yeah. So in the self-attention, all those are, hold on, those q, qk and v are coming from the y hat, right? Instead of x, you want to replace these x with y hat, right? Sorry, what was that again? What was the y hat to be? The y hat is going to be my, so whenever you train this system here, you're going to be predicting the first word, okay, I didn't even tell you, you're right. So this system is trained to do translation. You put a sentence in input in one language. I'm very hungry. I'm hungry. And then you put the other language like in Italian, for example, o fame. And you want to make, you know, you feed here, I'm hungry. And this is going to be the representation of the I'm hungry, hungry, not hungry. I'm hungry in English. And then after you feed, you put here I'm hungry. The first word you're going to be outputting is going to be here in Italian. Oh, and if you put all down here, you enforce the system to put fame as output, which is hungry in Italian. So if I say, okay, let me, let me put, let me write down maybe if I want to say, okay, so it's, so it's attending over its own output of the translated version of the cut. Yeah. So let's say a cut in English. And then you have in Italian. Okay. So first you have a cut that goes inside the encoder. And then the encoder speeds this guy here. And there is one H associated to each of these inputs. And then these, you put this stuff inside here. At the beginning, you're going to put a big zero here. And this stuff is going to spit out a one, which is the A. Then you put the one down here. And then this guy is going to be splitting out gato. They put gato inside. And this guy is going to say finish. That's it. You got to the end, right? Yes. Okay, then every time you get a different input here at the bottom, the decoder can decide to look at different components of these H encoder. Okay. That doesn't make sense. Yes. And what's going to the cross attention module in that case? The cross attention module is getting the output of this wire here is the output of this add norm. So the output of this add norm goes inside the here for the queue, the query in the cross attention. And then it's getting this guy here, the values and the keys from the encoder. Okay. Thanks. Okay. There were two questions you said. Those were the two. The first was about the how many weight matrices you'd use for a single head. Okay. Okay. Okay. Okay. Okay. Thank you. Sure. Of course. I hope it was a bit more clear than yesterday. But again, I noticed there is a lot of it was pretty dense today class. I was just confused about what the cells referred to and what the cross referred to. Yeah. Yeah. Yeah. Or questions in this example of the cat on Gato. So you said out of the decoder, you get some representation and you pass that back into as a Y. So to me, that looks like some kind of recurrence. Is it not? This is called auto regressive. So this is for generating text. And so to generate text, you had to generate the first output. Now you feed that output inside, you're going to get the second guy. So the encoder doesn't have any auto regressive thing. The encoder just generates this h encoder. Then the decoder is going to be generating one word at a time in an auto regressive fashion. So the right guy is a generative model, right? But do you train the encoder and decoder at all at once when you're training this model? Yeah. So how does it, how do you train it? If it's auto regressive and one step depends on the previous step. During training to have the entire sentence, only during inference is auto regressive. Oh, so inferences, auto aggressive, but training is, yeah, because you have the Okay. You just mask the future time step so that you don't show it. If you're trying to do it for the first word, it receives only the first, not everything else and so on. Let's go like a look ahead mask. Okay. Interesting. Thanks. Just going with your example from earlier with this, the cat and we were saying the Q in this case would be like an Italian word, right? Like the second time around and on top right, right? Right. And then it comes down and then we feed it through and the Q goes into the cross attention. And at that point, K is going to be the key for the English, the encoded English word. But then the value is also going to be from the English representation, right? So how does that end up spitting out like an Italian word? Good question. You stack multiple of these modules and somehow the magic happens. I don't know. The values and the keys are coming from the English. But again, this, okay, so whenever you train this system, you end up with representations that are basically language agnostic, I would assume. Therefore, you have on one side English, the other side Italian, but in the middle part, whenever you have these kinds of embeddings, we can assume these to be like language agnostic, right? And so the question is going to just figure out, hey, this Italian word is looking for something that looks like this. What are the encodings? What are the embeddings here that are matching my specific question right now? Okay. So I think that could be like an interpretation, I guess. So you have Italian down English down here. And as you bubble up the encoder, you remove the language specificity, and then you kind of reuse this kind of encoders, I guess. I mean, this is similar to how it works in the encoder-decoder recurrent neural network. You have an encoder which is encoding one whole sentence, and then you can have like a representation of that sentence that doesn't depend on the language anymore, right? And then actually, after having the recurrent network, you used to have a like a decoder, which is, or just using that final representation, or you can also have an attention, which is looking at specific, I think, time steps in the past. So it's part of the language, natural neural language translation, N-T-M, right? Neural language, N-L-T-M. Machine translation. Machine translation, right? That's part of that kind of stuff. Sorry, so just a last question from what you just said. Did I answer your question before? I mean, this is my guess, right? It's like the fact that the embedding is going to be the H. H encoder are like, they are stripping off the language-specific information. They are just concepts, right? They are just the representation of the concept without the language attached. There you go. That would be my opinion. In a sense, like that Q, in a sense, it's going to be an embedding that's in itself. So it's, and then that's going to be compared with the K. That's correct. And the Q in this case comes from your language that is going to be the target language, right? Right. Okay. Makes sense. Thank you. I would really recommend having a look to the blog from my friend, which is called Illustrated Transformer. It's very, very, very nicely written. And it's a bit, maybe it has a bit more context about the language part. I try to, you know, I try not to have the language inside this presentation because, you know, you can use this transformer for any kind of data, right? And basically, these are mapping sets to sets. But again, maybe this example here was just, you know, very tailored to the translation part. But it doesn't have to be translation. You can also have like transformers for making generative models pixel by pixel. So you can actually draw things with this thing, with this architecture. The Illustrated Transformer by J. Alomar? Yeah, yeah. Yeah, I really, I really like the way he sees things. But he has all my matrices transposed. So that is like bugging me. I think I'm the one who transposed the matrices. Everyone says them horizontal. I think it was the math was nicer with the vertical guy. More questions. You said that the encoder representations would be language agnostic. Doesn't that assume some sort of similarity in the way the line in both the languages you're translating, like to and from, like, as languages, like, for example, something like this for the representations to be language agnostic, you have to say between English and English and French, there's more of a similarity than say English and Chinese. Not just with respect to the kind of data that's available, but also the grammatical answer to both of the languages. So does this work as well across languages for languages that aren't as similar? Or does it how much worse does it perform if it doesn't work as well? No resource languages or like languages which are not very similar is a problem for any, any model like transformers and something special that can solve it. So I don't think it's going to attack that problem. All classes of models have issues with like when the target and source languages are very different. Has there been work surrounding trying to bridge that gap just out of curiosity? And if yes, but like it's an open problem for sure. Okay. More questions for me or for Aishwarya. Language question for her and content maybe questions for me. Can I again ask them about the encoder and decoder that could could we go back to the page? Yeah, thank you. So I understand cross attention is we have attention with the encoder in the states. I'm a bit confused about the self-attention part. So here self-attention only happens in why because we couldn't see the words in the future, right? So when I input like when I'm in time t, then the self-attention, what does self-attention do? Does it do attention among all the y that had all the y that is before the time step t? So we had to specify actually I think I've done a poor job explaining here. So there are two parts of this. The first is training and in training you have the whole sequence. But then of course you can't look at the future outputs, right? So the first y here at the bottom cannot look at the second y, right? And the second y cannot look at the third. And so like the first one cannot look at the all future y's, but the future y's can look at the previous y's, right? Because you can always know what you are going to be outputting, but you cannot know what you might output in the future. So whenever you train the system, you also need to somehow mask the future information that you're providing to the system. And so here you're going to have the whole set going in. This basically first module of the decoder generates the questions and the questions come down here are going to be retrieving the information from the encoded sentence. And then this encoded sentence somehow gets converted into the other language. And then this is done all in one pass, boom. Whenever you actually do inference in this case, you're going to have you basically start with a specific representation from the encoder. You get no initial value. So you get like a zero maybe. And then you're going to ask, you know, a question, what is going to be the first word I should start with? And then this one is going to tell you, oh, you should start with a translation of a maybe, right? So you go end up with a one. And then you go, you place this one down to the input. And so given that now the network knows that, oh, I already outputted one, then what is going to be the next question is going to be, oh, hold on, listen, what's going to be my next word after I have inputted one? And this is my second question, right? And the second question is going to retrieve, oh, you should talk about the cut. And the cut is this representation and this this cut representation gets therefore converted here into the corresponding Gatto that is, you know, cat in Italian. And then you put cut Gatto back here. And then same process is going to say, oh, you reach the end of the sentence, you have like period at the end. Will Gatto have attention with own when Gatto is the second word? You can see everything at all time steps before. So when they see it, it means they can have like attention. Yes, yes, attention is over all time steps before the like up to the current time step. Okay, okay, got it. Thank you. And I would recommend to watch those animations from the distilled.pub. They have like an article about the attention and they show you how each word looks at different specific other words. Like if you want to say, I couldn't fit my I couldn't fit my trophy in my suitcase because it was too large. I guess is the trophy was too large, so it wasn't fitting. But if you say my trophy wasn't couldn't fit in my luggage because it was too small, then the small is going to be attending now to the luggage, right? Because if it was too small, then you actually can fit it. And so if you check, for example, like the sentence is actually the same, the small or large are both adjectives, but one objective will look at the trophy and the other objective will look at the luggage or the suitcase. And so again, I would recommend watching the checking this distilled.pub article where they actually give you some nice visuals about how this attention checks different parts of a sentence. And these are called something fancy. Maybe Ashwaya knows these sentences. I forgot. Venograd Schema Challenge? Yeah, there you go. Venograd Schems. And these are like... Do you mind sending that link or at least directing us to it? What is it called? distilled.pub? Yeah, distilled.pub. That's from my friend Chris, Christopher, Ola. He used to be at Google Brain. Now he's at the OpenAI. And he's basically sponsoring himself this website, which is actually basically an online journal where you have very cute visualizations, right? So I make videos and presentation. He makes interactive articles. And yeah, I really recommend reading everything from there. I really like it. Yeah, I have nice friends on the internet. More questions. I think this should have been spread across two lessons to spread the dance. Ashwaya, do you have any idea about what the Reformer Network does? Yeah, so you can actually check the Reformer Network on other blog posts. I forgot them. It was basically to deal with longer sequences because the current ones are not able to deal with more than 512, for example, but yeah, they do some fancy LSH attention. I don't know, like it off the back of my hand. And the problem with having those long sequences is this one, right? So this one blows up, right? Again, there is a blog post from a girl, I forgot her name, Lilian. Lilian has a nice blog post from two days ago or three days ago, which is called the Transformer Family. Okay, there are some errors in the blog post, but okay, I think it's good. More questions or I'm going to be cooking dinner. Sorry, could you just say one more time the title of the article from the Still.Pub? So the Still.Pub, let me check, I actually don't know exactly. Okay, this is Jay, super cool guy. And the article, he actually, this one, right? So this should be coming from the Still.Pub if I'm not mistaken. So the thing I was referring is this one, okay, this illustration, these pictures from this illustrated Transformer. And I believe this is coming from the Still.Pub from Christopher Ola. So the other website is the Still.Pub. Yeah, there you go. And so here you have the attention sequence modeling. Attention, there you go. I think it's coming from here. Here it talks about the hard attention and soft attention, I think. Okay, maybe I lied. I was talking about these pictures here. Okay, so I thought this was coming from the Still.Pub. Maybe I was mistaken. These are the pictures I was talking, like they they attend to different words. Okay, okay, the animal didn't cross the street because it was too tired. And then you have here that it was the animal animal, right? But then it was too, maybe the animal didn't cross the street because it was too wide. In this case, wide, instead of tired should be, you know, attending to street in this case, right? Both it can be street and animal, but you know, you have higher scores here, right? So the scalar product is has a higher score in this region here, right? Do you need anything from me more? No? Nope. Okay. All right. Bye-bye. Nice seeing you. Nice talking to you. So we are done now. It was quite a substantial lesson, right? So again, how can you get more out of these lessons? So again, comprehension, something was not clear. I've done a poor job. Just then ask me anything in the comment section below. News, you can find everything about what I'm doing and what I'm teaching on Twitter at the AlfCNZ handle. Updates, again, so if you subscribe to this YouTube channel, you're gonna have the latest videos as soon as I upload them online. If you like my work and this video in particular, just press the like button. This video has an English transcript you can find in the course website, where all the titles are linked to the sections of this video. Parli italiano, hables espanol, nisuo putunguama, you speak Korean, you speak Turkish. We have all these translations now available on the course website. So go there and check it out. If you'd like to have your own language available as well, feel free to contact me such that we can get started with the translation part. And finally, you should try to go over the PyTorch notebook we have covered in this class and make yourself familiar with all the methods and classes and all the little things you should try to train this notebook, change parameters, such that you can get some good understanding of what it was all about. It was quite a lot this time, so you'd better check out this notebook. And finally, if you find errors, typos, and everything you think I can do better, we can improve the content with your help if you contribute to the Github repository where the website is hosted. And that was pretty much it. Again, thank you so much for sticking around with us and bye bye.