 All right, welcome back to class today. Thursday, I think, yes, 9.30 AM, New York City, live. So before starting the lesson, still some shout-outs for interesting things. So, OK, as you can see from the terminal, this is a very interesting utility that you can use for listening to music, perhaps. Maybe you shouldn't. Maybe you should. I don't know. Anyway, so on Twitter, I found another person that is also following suit as YobiBite from last week that a couple of weeks ago we saw. So also Andreas is creating like summaries of papers using Notion, right? So you have like, if this thing loads up, maybe it doesn't load. There we go. The what, why, how, and then the end, right? So this is very relevant and important, I think, in order to be able to get some sort of sense out of these many papers. What else? We saw something else, I think. And also, yeah, YobiBite has kept going with these papers. So if we go on YobiBite profile, we see that he has these, I don't know, the latest one, which is the role of planning, right? From the role of planning in model-based deep reinforcement learning. Cool. Still, this kind of template, which is very convenient. I also found a website which is helping you read papers collaboratively, right? Not sure. Hold on. Where is it? Oh, this one. So in this case, you can ask questions maybe to the author or to other peers that are reading the same paper. Like, does anyone know what this means and so on? So this is, I think, a very nice, convenient way of reading paper together, like reading groups. And if this thing actually is going to be used by many people, it actually is going to be helpful, right? So you should check out this. I think shared is not enough likes right now. I think people should use this extensively in order to be able to digest more papers, right? Because otherwise, how do you learn new things? I cannot be always next to you by your side and try to explain things. We should be always helping each other and provide our knowledge to others, right? Because we are kind people. All right, enough talking about website and stuff. And so we can start the lesson from today. So what do we talk today about? And we go in full screen. All right, so foundations of deep learning. Me, Alice, OK. So today's lesson is going to be slightly more mathematical, perhaps, as in there is quite a bit of notations. And you should really try to stay up to speed. How do you say? Stay with me, right? If something gets lost, then you might get lost and you can't recover. So try to pay attention carefully. And if you don't get something, ask me to repeat. I can repeat forever. I don't mind, OK? All right, so today, finally, we talk about attention. Why do we talk about attention? Because have we seen, like from the last two years in natural language processing, attention has had such a major effect. And we had so many good results. Moreover, recently, we also saw many results of applying attention and something that we're going to be learning about later, which is called the transformer to image data, OK? But what are these models about, right? So what is the thing that actually differentiates attention from other techniques we have so far, seen so far? Well, attention actually works on sets, works on sets of elements. Whereas we have seen before that, like both convolutional and recurrent network operate on lattices, right? So on some sort of grid in one dimension, two dimensions, or three, and so on, right? Regular grids. These actually, these architecture here operates just on bunch of vectors, which don't have necessarily to be in a grid, OK? So this is like a first relaxation we are going to be seeing today. Next week, we're going to see how to learn information and data on graphs, where a given structure is given to you, right? So these are becoming a little bit less, I would say, common, perhaps, common knowledge. Like it's a little bit more niche. And it might take some effort to actually understand what's going on, OK? So just say with me, I take you by hand. From the beginning to the end line, OK? All right, so attention. There are two different types of attention. We have self or cross attention. And then we're going to be also seeing this distinction between hard and soft attention, OK? Anyway, we are going to be dealing with sets. What are some examples, right? So we start from there. So these were the slide from the Recurring Neural Net lecture, where we have seen four different mappings, sequence to vector, vector to sequence, sequence to vector to sequence, and then sequence to sequence. If you don't remember, watch the recording. So today, we're going to be introducing a additional term, right? It's going to be set. So we have all possible combinations, right? And I just wrote a few of them, the one that I could actually think about. There are more, and maybe I'm not capable of finding the specific example, right? Anyway, use cases. First one is going to be image to set. For example, we're going to be mapping an image to bounding boxes, right? And an example of this is going to be the Deter paper we talk about during class, right? So the image, basically, is going to be the input. And then the output is going to be this set of bounding boxes. You don't necessarily know which output corresponds to which bounding box. You can permute the output. Still going to be the same set of bounding boxes, right? Don't believe there is necessarily an order in bounding boxes, right? So that's why it's image, input through, you know, possible sets of possible bounding boxes, right? So this is the first case, right? Second one is going to be set to set. For example, the input is going to be point clouds to bounding boxes, right? So point clouds are going to be the X, Y, and Z coordinate of perhaps like a lighter image. Now you want to group different points that are belonging to this, you know, set of points in the input to this specific regions in the output. Again, which are not ordered, right? So again, set. So the sets are dealing with the fact that there is no order, right? Oh, we can have a sequence to sequence. So why am I talking about a sequence if we are talking about sets? Well, if you have a set and you add a counter, which is telling you which item comes before the other, well, the set becomes a sequence, right? So if you have a set of five elements, like you have the set of like natural numbers, right? From one to five, one, one, two, three, four, five, right? A set means you can swap the order, it's going to be still the same five elements, but given that those items are sortable, right? A set can also be considered a sequence, right? If things are sortable. So these networks that operate on sets can also operate on sequences if you provide a sorting mechanism, right? And these are called positional embeddings, like you can add some knowledge about the location of where these items are supposed to be appearing, right? In order to be able to sort them. So an example would be translation, right? You have a sequence, which is going to be, you know, sequence of symbols, which we're going to be representing a set is a set of symbols, but each symbol will also have a information regarding its own position, right? And so one set is going to be an order sequence of items and then for the input language, and then the other one is going to be the target language, right? Or the other one, which is going to be, I already talked to you a couple of lessons ago, three lessons ago, right? Which was the Dali architecture. Dali was that network that was generating those very pretty pictures, given a textual description, right? That's crazy. So the input is going to be, again, this order set, which is, again, a sequence, then the output is going to be somehow this image, right? An image has order, right? So you need to actually provide the order with which these items are generated. Perhaps I'm going to make a lesson on this Dali as well, architecture is interesting, I think. Which is using a discrete variational out encoder, is an image feature compressor, image encoder, like image compressor, I guess, yes, and then is using this transformer we're going to be learning today to generate these images. Sorry, to generate, yeah, to generate the images we are using the transformer and the input is going to be the text, right? So the input is the text and then we try to generate some sort of compact representation of the image, there you go, which is given to you by a code of variational out encoder. Then we have, for example, set a sequence to set. For example, if you have a sequence of like a signal, a one-dimensional electrocardiogram and then you want to find the location of some, let's say, dangerous or whatever anomalous sequence, right, part, right? So you're going to find the extent and the location of these regions, right? So it's like bounding box using one more or less. We can have image to vector, for example, the image, visual image transformer, you can have sequence to vector, which is going to be this order set to a movie review, for example, right, to a vector. Anyway, enough motivation. I think we can start with the actual lesson, okay? Cool. So self-attention. We're going to be talking now again about sets, right? And this is going to be the notation I use for sets. You have the vector, like in this case, a bold X index I and then you have curly brackets, right? And then you have the I equal one to T. So in this case, I'm going to have these vectors, X one, X two, so on until XT in a curly bracket, right? So again, you can swap them. Nothing changes. It's still the same set, okay? So the set is going to be the same set given you permute the items within it. Cool. So each item here, my XI, is going to be my vector, my input vector in n dimensions, okay? So the I represents the ith vector, not the ith element, right? The I is actually white. So it's just, I have a little bit of notation in order to identify the ith vector in the set, okay? So the only question we have in attention is the following, right? So that's quite easy, I think. So ith is going to be my linear combination of my vectors in my set by using these coefficients, alpha, okay? So each vector in the set will have a mixing coefficient, alpha one, alpha two, alpha whatever T, right? Which are used to scale the amplitude, now of these vectors, and then you sum them all up, all of them, right? So as you, one restriction here, we actually have that these alphas are positive. So we actually just sum from zero, like a scaling factor going from zero to the maximum, to whatever value you want, we don't flip them, right? There is no negative alpha. And we can also use some notation here and we convert this set with arbitrary order of these vectors in this matrix, okay? And so those axes are columns, right? You had several axes, bold lowercase axes, I just put them together into a box, right? And so I'm gonna get the capital X. So pay attention here, the capital X, height is N, okay? So if you have many columns, right? I just put them together in a box, the height is going to be N, right? How many vectors I have, lowercase T? And so the width of these big box is gonna be lowercase T, whereas the height is going to be N, okay? So keep in mind, so you have to actually keep in, you know, keep in mind these numbers, these letters. We have to figure out now, so actually I'm gonna draw, yeah, I answered the question a sec. I'm gonna be drawing this box over here, okay? So this big box repeating once again is the collection of all these vectors, vertical vectors. Each vector X1, X2, Xb, Xt has height N, okay? Are you with me? Yes, put thumb up, right? So the height of the vector is N. So the height of the box is gonna be N because those things are of height N. And then how many of them you have? One, two, blah, blah, blah, T, right? So the width of this big box has to be T. Cool. So here you have that this linear combination of vectors, it's simply the matrix capital X times A, okay? Because we know, even though this semester I didn't go through with you by hand, that the matrix vector multiplication, it can be thought as well as the linear combination of the columns or the matrix X, okay? So we already seen matrix vector multiplication. Where have we seen matrix vector multiplication? Remind me, type in the chat. Where have we seen matrix vector multiplication so far? Yeah, dense layer, okay? And so how do we think about dense layers? How did we say, how did we say we think about dense layers? How do I call them? Yeah, linear transformation, sure. I know, how do I call them, right? We just kept repeating these two words all over that lesson, the second lesson, right? Yeah, no, squashing is the nonlinear function, right? And rotation is the other one, okay, thank you. So I usually been telling you all the time, right? Neuronet is simply rotation, squashing, rotation, squashing, right? Here again, we have a matrix vector multiplication. In this case, I won't call this rotation because what would be rotating? What would be, what would we be rotating in this case? The vector A, doesn't make sense. Also in the rotation part, the rotating part was which matrix? Who was doing the rotation? In the rotation, squashing, rotation, squashing. What was doing the rotation? The weights, right? And the matrix, the rotation matrix was the weights matrix, okay? And so the network weights, which we are learning, are doing the rotation of my input vector that comes inside the network, right? In this case, it looks very odd. We have this X, which is the combination of all this input, like this stack of all the inputs. It's rotating this A vector, which we don't even know what it is, right? So we don't think about that. We don't think about this matrix vector multiplication in terms of rotation because it doesn't make any sense in this case. Instead, we're gonna be thinking about it in this way, right? So whenever you have a matrix vector, you have the vector has as many components as the column of this matrix, right? Because you're gonna have that, whenever you do matrix vectors, it's gonna be the first column times the first item, right? Plus second column times second item, plus third column times third item, plus the last column times the last item, right? So this is vector matrix multiplication. If it's not familiar, just review some Gilbert strength, first chapter of linear algebra, introduction to linear algebra. I hope it's fine. All right. So I'm gonna be just writing here on top right, the compact version, right? So my hidden layer is going to be the linear combination of the columns of X, which are the items in the set, right? So H, the hidden layer, is going to be the linear combination of the items in the set, weighted by the coefficient in this A vector, which I call them alpha, okay? Once again, my hidden representation is going to be the linear combination of the vectors in my set, right? I have my set, a vector one, two, so on. I just do this, the linear combination of these vectors, scaled by the coefficient, right? That are in this A. So regardless of the number of the vectors, H is gonna be what size? What is the size of H? If H is the linear combination of vectors of size N, H is gonna be size N, okay, very good. All right, cool. Okay, I hope it's clear. So in the soft attention, we simply enforce that the sum of these items in A are equal one, right? And so basically it's like a probability, right? So if all items are greater than the zero and the sum has to be one, then they are basically no, basically a probability, right? Absolute probability, whatever. In the hard attention instead, you just have that one item is equal one. So it's like one hot encoding, right? So the hard attention, the A is like a one hot, like a deterministic probability, like deterministic mass density, right? And instead the soft attention is gonna be like a mass density across whatever T element, right? So the size of A is T because there are T vectors, you need to scale, right? I hope it's fine. Moving forward, otherwise we don't even finish this lesson. Okay, this was first slide. We have two more slides, that's it. But you have to be very confident about this N, the size of N, and T, the items in the set, such that the size of A has to be also T because you have one coefficient per every vector. I hope it's fine. Self-attention, slide number two. What is this A? A is going to be the soft, or just the argmax, right? Soft argmax or just the argmax of this thing here. What is this thing here? So we have capital X, which was this box of vertical vectors transpose, right? So now capital X transpose is gonna be these horizontal vectors, right? Multiply it by my X, what is X, right? So capital X, lowercase X, right? So capital X is gonna be this combination of columns, right? So you flip it, it's gonna be combination of rows. Lowercase ball X is gonna be A given item in the set, okay? The same as before, I was calling this H just this capital X A, a given A. So this is gonna be like the generic X, okay? One of the items in the set, I call it X. I don't put the index I because I don't want to specify which one, it's just one, I don't care which one. Okay. So what is the dimension of this A? So if you have, how many rows does capital X transpose have? Capital X had T columns, right? If I flip it, you're gonna have T rows, right? Then multiply it, first row times this one, like you had, you know, T times column, whatever is gonna be T eventually, right? What are each multiplication telling you, right? What is the first row times the first column, right? And so what is the first row times the column? It's gonna be the projection, right? Of the first item towards the given X, right? That's the projection of the second item towards the given X, projection of the third item towards the given X and so on, right? So we compute basically a vector which is containing all these projections. What are the projections? Projections are telling you how much a line two vectors are, right? Basically, cool. And then we send this through a argmax, which is gonna tell you, oh, which is the highest match, basically, or a soft argmax, right? Which is gonna tell you this probability distribution across this, across these scores, right? Done, finished, right? That's it. There is not much more. So what are the square brackets? That means it's optional, right? Means you can choose to do soft argmax to have the probability or the argmax to have the one hot, right? And so let me recap the things, right? I had T's, T vectors, X, I, right? Dun, dun, dun, dun, dun, dun. This implies that I will have A, the T, A mixing coefficient, right? If I have T items in my set, I eventually will end up having T attention vectors, right? You're following, yes. And so we can put together this set of attention vectors into a box, right? Each attention vector has T elements, right? Because each attention vector, it's scaling those T columns, right? And then how many of them I have? Well, I have T, right? Because I have T axis, right? Cool. So given that I have a set of attention vectors, I will have T hidden representations, right? Which I can also combine into this big block, capital H, right? Cool. So finally, I can simply write that capital H, which is this big matrix of hidden representation, is going to be this capital A, capital X, capital A, right? What else capital X, capital A means, right? So capital X, capital A means you have X times first vector, A is going to be the first H. Then you have the matrix A times the second vector in the A matrix. You're going to have the second items in the H and so on, right? So if you stack multiple vectors after A, right? So you have the first vector of A, second, third. You're going to have several outputs in the several outputs, right? After you complete the matrix multiplication. What is the size of H? Well, each item in H was the linear combination of these items in my input set, which were of size N. So the height of H must be N, right? And then what is the width? Well, we had T items in the set. We're going to have T hidden representation, right? One hidden representation per input vector. If you swap the X's, you can just swap the H's, right? There is no connection, there is no information in the order, okay? Cool, so far everyone's following other questions. Can we say A is like a self correlation matrix? Yeah, in this case, yes. Of course it is. We are going to be learning now how to make things a little bit more interesting now. And this is like the foundational, like you had to, these first two slides are here in order to make you understand what's the difference between N, which is the size of the input or the size of the combination of the inputs, because again, if you do linear combination of whatever vectors, you're going to get the same size. And T is the number of vectors you have, okay? So these are the two main things we have here, N, T, right? And then how is H computed, which is the linear combination of the columns, right? All of these items in the set. And A is going to be given to you by these argmax or soft argmax, right? Or the one hot or this probability. Cool, and there's the major thing, which is called a key value store. What is this stuff, right? So it's a paradigm for storing or saving, retrieving or curing, querying, I don't know how to pronounce query, I think querying and managing an associative array, dictionary or hash table, what does this mean? So this means that we're going to be using these concepts here and convert them in terms of neural nets in order to store, save, know, retrieve, take out query or managing our information within a neural net, right? So we're going to be basically creating this associative array in hash table where we put the information and we can take out information when we want. So we start with these queries, keys and values. So back to this metrics multiplication where finally the metrics is the weight metrics, right? So what is Q question to people at home? How am I going to be describing Q using my jargon? Q is finish my sentence, the rotation of X, then thank you, very good. K, so Q stays for query. K is going to be again a rotation of my input X. V, which is a value is going to be again my rotation of the input X. What is X? Do we remember? X is the generic item in the set, right? You pick one, that's your X. Lowercase pink ball X, cool. So since the query, the key and the values come all from the same X, this is going to be called a self-attention, okay? Next slide we're going to see that what's going to be a non-self, right? What is the cross-attention? So far, let's just focus on this self-attention. So everything we're going to be seeing in this slide is going to be exactly what we saw before, but we're going to be changing slightly the content of that soft argmax we saw before. So we can assume that query, and we have to assume that query and key have the same dimension. We call it D prime, why is that? Because we're going to be checking one query, let's use my question, against all possible keys, okay? Which are the things I can look up in my dictionary, for example, or in my recipe book, okay? The titles, for example. So the query, my question, has to be matching the size of my keys, because otherwise I cannot compare them, okay? Let's call, let's say this way. On the other side, my V, which is the value, has a totally arbitrary dimension, okay? So V is going to be the content. Let's say the content of a recipe, how to make pizza, okay? So the name of the pizza I want to make, like Margarita, you know, Diabola, or you want to make calzone, or something like those, those are the title of the recipes, and those are in the key, okay? In the, those are the some size key, the title of my recipe. V, instead, is going to be the content of the recipe. It's going to be very long with images, and explanation, and whatever. The query is going to be, how to make pizza Margarita, no? Or how to make pizza Quattro Stagioni, something like that, okay? So whenever I have a question, how to make something, I'm going to check all the titles in my recipe book, those are the keys, right? I check the, which one is going to be matching, and then I want to be retrieving the value, which is going to be the content of the recipe, okay? How do we do that? Well, we see that in a second. To simplify just notation, so I can remove all these, the prime, the second, I just say that queries, keys, and values all have the same dimensions, but it's not true, right? In the value, we just figure that it's a very large possibly recipe, right? Whereas the Q and the query and the key are simply the question in my title, right? So this is just an assumption to make notation simpler here. So how many items we had? We have T items in my set, right? X1, X2, X3, blah, blah, blah, T items in the set. This implies that we will have T queries, T keys, and T values, right? Because each query comes from the rotation of each item in the input set. Each key comes out from the rotation of each item in the input set. Each V comes out from the rotation of each item in the input set, right? So if we have a set of input, we're going to have a set of queries, a set of keys, a set of values, which I can also draw as boxes, right? I have a big box for Q, I have a big box for K, and I have a big box for V, right? Cool. So finally, last question of the day. It's going to be what is my attention vector? My attention vector is going to be this soft argmax or argmax of this capital K transpose times Q. What does it mean? Given a query, Q, how to make pizza boscaiola, I have to check all the keys, right? So I have my vector here, my query, how to make pizza margarita, whatever. And then I had to check this with the pizza margarita, pizza boscaiola, pizza quattro stagioni, pizza quattro formaggi, all different types of pizza, okay? One query, okay, again. All possible titles, right? So you're going to be checking one question, again, all your possibilities. How many possibilities you have? Remind me, how many possibilities I have in a K transpose, T, right? Very good, because we had T columns, so we flip it, we can have T rows. I multiply T rows times one vector, and that's why they have to be the same dimension, right? So if my K is gonna be D prime, also my Q has to be D prime, otherwise I can not multiply this stuff, right? So I have T of this, multiplied by one vector. What is the final size of my attention? T, because it's written also on the screen, right? Of course, we're gonna have one score per each key, given a specific query, okay? I repeat, the A in the attention vector is gonna have T items, one, two, three, until T. Each item will have basically a score, corresponding to a given query, a score per each key in my recipe book, okay? So my hidden representation H is going to be this weighted sum of the values of these items in the V set, weighted by the coefficient in the A. Doesn't matter how many items you have in V, right? You can have the same items in V that are the elements in A, right? Cool. We remember we have a beta, no, inverse temperature. We're gonna be setting that to one over square root of the dimension D. Why is that? Such that the temperature doesn't change as you change the dimension, okay? This is because if you think about what is the length of a vector in two dimension with both of them equal one? It's gonna be a square root of D. If you have, what is the length of a vector in a three dimension with all set to one? It's gonna be a square root of three, right? And so the longer the vector and the longer becomes the whatever expected length. So we divide by the square root of D such that things are still within the same dimension, like the same length regardless of the dimension, okay? The technicality, we don't care. Now, how many queries, how many questions I have? Well, we have T items in my set, and you're gonna have T questions, right? Therefore, this implies I will have T answers, no, T, A, T attentions, vectors, which implies I'm gonna have a matrix, capital A, which is the combination of these columns, right? The box, which is T by T. Why is T by T? Well, we said that each vector has T components, right? Because you're gonna be using each component to weigh each item in my value set, right? But then how many of these vectors I have? Well, I have one vector of T components per each item in my set, right? So you have T components and T items, right? So it's gonna be a square matrix, T by T. Cool. So final question, we just use all capital letters. Capital H is gonna be capital V times capital A, which means the columns of this hidden matrix is going to be the linear combination of the column of the V matrix, and the linear combination coefficients are stored in this A matrix, which comes out from this scoring, or this, you know, whatever you want to call it, between one query and each key, okay? Okay, I hope it's clear. If it's clear, we move on to the one last tweak to this presentation, which is going to be the, let me clear up first, like actually, let me write that on the top right, this just for reminder, right? So we had the queries, which was a set of queues, no, a set of questions. And again, the height, each of them are D, and you have T of them, right? So let me clear up the screen, and we're gonna be talking now about replacing the input, or whatever, replacing the X for keys and replacing the X for the values. What is this? So this is called cross-attention. If before I was just thinking about how to make different pizza type, pizza, right? The literal of pizza is pizza, not pizzas, but anyway. So if before I was thinking how to make different pizza in my mind, so all information was coming from X, which is my own brain, now actually I'm calling mom, mom, sorry, right, the other Greek letter. Mom, how to make black pizza, no? And she tells me, right? So I have, my question comes from me, from lowercase X, but I call mom, which is gonna be XI. Mom has all the keys, no? Of course, mom always has all the keys and all the recipes. That's the only difference, right? How many questions do I have? Still T, but how many answers my mom has? How many, what is the knowledge of a mom? Much larger than mine, right? And so XI goes from J equal one to tau, right? And tau in this case, in this example, is much larger than my T, right? For cooking, for sure. Anyway, so given that mom has a set of tau XI, it will turn out that mom will have tau keys and tau values, Vs, right? And so these matrices K and V, the height is gonna be still D, right? And the key, the K has to match, no? The key has to match my question, right? So still have to be D here, but then the width is gonna be much, much larger, right? Because mom is very knowledgeable. You're following, right? All right, so let's watch, let's look at the equations, which are just slightly different, right? So we have that the attention vector now has tau components. Why does it have tau components? Well, simply because the capital K now has many, many more rows, right? It has tau, tau rows, right? And so you end up with tau mixing coefficients for the values that mom has, no? For the recipes that mom has. So mom has this huge repository of recipes. Each recipe is gonna be weighted by a coefficient in this A vector. How many coefficients do you have? One per every recipe. How many recipes are there? Tau, okay? What is the final size of H? D, a second, which again, here for simplification, I just put D. How many questions do I have? Oh, well, those are from me. I have little questions, no? So I have T questions. And so eventually my final A matrix, we have T questions, right? T columns, but each mixing, each column will need to have tau elements, right? Because each of these coefficient will be used to mix those tau recipes my mom has. So final equation, no, big difference. You still have this H, which is the outcome of asking T questions by mixing these D dimensional values in my mom's recipe book or whatever. And so eventually you still get D times T, right? T columns of D size, cool. Any question about how to make pizza? Is it clear so far? I hope so. I mean, I really try my best to explain things in a palatable way. Okay, you appreciate very much the pizza example. I'm glad. Questions so far, before I move on to the, some more details. Where do you get Xi? I call my mom at the phone in the analogy. Or in practice, you have two sets of inputs, okay? So X was one set of inputs. Xi is another set of inputs, different number of items, right? I have T axis in my set here, that are generating the question later on. And then here I have Tau size, which are the size that are used for generating the keys and the values. If I do attention within myself, it's called self-attention. I just think in my own brain what I can do. If I want help and I call it, I call home, I just do cross-attention because I'm being asked in question to my mom to a different set, okay? I hope it's clear. So you had two different sets of inputs, okay? Very good. All right, moving on. Can I ask more questions? Yes. So implementation. In this case, I'm talking about self-attention, so I just use one variable for easy of notation. So here I can think about stacking all my queries, keys and values in one big vector, such that I can compute this big vector by having these big metrics here, which is gonna be the stack of the WQ, WK, WV times my X, right? So you have stack of three matrices here, my X. You're gonna have the stack of the three outputs, right? You can stack things in both directions, right? You can stack multiple X's this way, you can stack multiple matrices this way, right? All right, we see this, we've seen this already before, right? So we saw this whenever we were talking about recurrent in your net, we were having that I will stack in multiple X's, multiple X's in this case, right? Tuck, tuck. And there we actually were stacking this horizontally. So we already seen this kind of stacking operation, right? So we are gonna be considering now H heads. So we can get like three heads of dimension D, right? What is this? So in this case, I can stack multiple matrices, right? Even more. And in this case, basically I'm gonna be generating multiple questions given the same item in the set. So far we have seen that there was one question per item in the set. In this case, if I have multiple question query matrices, I'm gonna generate multiple questions, multiple queries per given item in the set, right? No big deal. So I have multiple questions given the item in the set. I'm gonna have multiple keys per item in the set, of course, because I'm gonna check all those keys anyway for given a given question. And then I have multiple values possibly for a given X, right? So that's it, no big deal, right? So you can have multiple questions, keys and values per given item, right? And then we finally use like a final rotation matrix. We dump down, like we compress down this HD back to D dimensions, right? To go back to the whatever dimension we wanted. All right, so in the last 10 minutes, we're gonna be looking now at how this is actually used to do something useful, okay? So we're gonna be talking about the transformer architecture. So transformer architecture. Last time I mentioned it was a encoder. The encoder architecture is not. It's a encoder predictor decoder architecture used for natural machine translation. So let's figure out what is this stuff, right? So we saw already in a previous lab how these network, the model, the architecture, the combination of this module for the latent variable energy based model are combined, right? So we have an X, which is our conditional variable which goes inside a predictor. We have a, so those are shaded X, pink. We have a shaded blue Y, which is my target that goes inside the energy box. And then on top I have a decoder, which is fed with the latent and the output of the predictor, right? Which produces this Y tilde, which I try to get close to the Y, right? So the E in the box is a spring, right? And so it's a spring that is like trying to get the Y tilde close to Y. Cool. So predictor decoder predictor is fed with an X, right? To come in Y space, right? So predictor is necessary to move from one space to the other. All right, then we also saw in a few lectures ago the out encoder, right? The out encoder doesn't have any more deconditional variable. It only has the target variable, right? We are gonna just learn the structure in these targets. And so the Y goes inside an encoder, which gives me a hidden representation where it's hidden because it's inside a model. Then I have a decoder, which is coming back to the original Y space. It's Y tilde, tilde means more or less, circa, circa, and then there's a spring between the Y on the bottom and the Y tilde on the top such that those are close together. Finally, we introduced the transformer. So let me clear up the screen and we have basically the same item so far. So first difference, okay? So so far everything the same, no big deal. First difference, we have this module over here. What is this module over here? So this is something that comes from digital signal processing. This is a Z minus one. Since Z, we use it for latent variable, I use a data in a Greek, same stuff, right? Doesn't matter. So what does it do? This module uses a unit delay. So why it's a signal, a discrete time signal? Like why has representation at discrete point in a time index? There's no time in seconds. It's a time index, right? One, two, three, four and so on. So what does the data minus one do? What is this unit delay? Okay, not big deal. You have a sequence, Y, the index J, no? And after the module, you're gonna have J minus one. So you delayed one step the sequence. So it's just a delay, one unit delay. You can call it delta T, but there is not T, there is not time, there is units, there are indexes, okay? So there is a difference in discrete time signals and continuous time signals. Anyway, unit delay. Afterwards, what do we have? Well, we have an encoder, right? So similarly to the out encoder, it's very similar to the out encoder, but in this case, we have this delay item preceding the encoder, okay? That's the only difference. And this is default to, when you perform language modeling, right? You want to be able to produce the future given the past. So you want to delay the input by, for example, here, one unit. Then, on the other side, we have our observation. What is our observation? It's, you know, whatever we provide a system during time training and evaluation. Y is gonna be only provided to you during training. This X pink shaded observed goes inside an encoder. And so what do we do with this encoded X and encoded delay Y? We feed them both through a predictor, which is gonna give me the hidden representation H used to generate the Circa Y, no, the Y tilde, which is the no longer delayed version, right? Because there is a spring between the Y tilde and the Y, right? So there's a spring between those two and the input to the encoder on the bottom right hand side here is the delayed version of my input, right? So I provide my predictor a delay version of the input. So this looks like exactly like a denoising out encoder where the noise is a delay, unit delay module, okay? But otherwise it's the same as a denoising out encoder. And it's not only a denoising out encoder, but it has also an additional input, right? So it's a conditional delayed denoising out encoder. So a delay denoising out encoder is also called language model, right? So this is a conditional because I have another input X conditional language model or conditional delayed denoising out encoder, right? Cool, oh, not denoising the delaying, okay? Whatever you want to call it, right? Anyway, so let me finish this. This is the transformer architecture and in the paper, unfortunately, they call the encoder, they call it encoder, but the overall block here, decoder, predictor and encoder delay, they call it decoder. And I'm like, no, it's wrong, okay? So this is wrong decoder. If you read the paper, you're gonna see that the wrong decoder will have the collection of the encoder, predictor, decoder. It does not make sense. That's why I wanted to clarify today because I otherwise get upset that I cannot understand what's going on. I hope it's clear. What are X, Y and Z? Oh, there's no Z here. X is the source sentence. Y is the target sentence. And Y tilde is the predicted sentence. Okay, cool. So in this case, the transformer encoder is made of several items. The first one is gonna be the self-attention. And then we're gonna have a convolution with the kernel size of one, okay? Which means it doesn't look at items that are nearby, it's just look at the given item, right? And then you have this kind of, you know, thing moving around across the signal. Anyway, after the self-attention, we're gonna have a residual block and a normalization. So you also add after this convolutional layer. So let's look inside this block, right? Here we have this add norm. So we have this residual item addition. And then we have this layer normalization. So the input is provided to the self-attention and then is summed back to the output of its own self-attention. And similarly in the hidden representation of this that comes out from this self-attention module gets a, you know, one dimensional convolution and then we have also this normalize, residual block and normalization. In this case, I'm gonna be providing my X, right? So as you can tell, if it's a X, then this stuff cannot be called an encoder, it's gonna be our predictor, right? And on the other side, we're gonna get a set of hidden representation, right? So how many items we have the input? T items, right? One, two, three, four, until T. Similarly, we're gonna have T hidden representations. One, two, three, until T. Going from I equal one to T, right? These are the source input, source sentence, source hidden representation. Then when I introduce the wrong decoder, we're gonna have the following. So here we're gonna be adding the decoder and quotations. Here we're gonna have a cross-attention, which is connected after the self-attention, which is getting the hidden representation of the source sentence, right? Then still normalization and residual connections. So as you can tell now, yeah, this is the wrong decoder, right? We can train this by providing the delayed target sequence. You can see J goes from zero to tau minus one, whereas the output is gonna be the hidden representation going from J equal one to tau, right? So it's a one shift. And which is used later to produce those Y tilde's, J equal one to tau, right? So source I one to T, target J one to tau, right? Different I, J, T, tau, right? Different lengths doesn't have to be same length. How do we generate outputs during inference? Well, during inference, I just produce the first output here, and then I put it back here, right? I generate a second output and I produce a second one. So during inference, I'm gonna be using a autoregressive manner, right? Technique to generate one item at a time. But the cool part that is during training, I can combine all the outputs at the same time, right? Before we'd recurrent neural network, we had to do back propagation through time. So we input one, two, three, four, five, six, seven, so on samples, and then you do back propagation tack, tack, tack, tack, tack through time, right? And so this stuff takes forever. And also it has the issue that has a temporal dependency. So you have to always go back and forth the whole sequence, right? It's not really parallelizable, right? Moreover, the hidden representation, it's updated sequentially. So the hidden representation over here, update was very far from here. And we had problems with long-term dependencies. How did we solve that? We had to introduce gating mechanism, which is like selectively remembering or forgetting information. And if it's very long sequence, maybe you don't even get enough gradients coming back or whatever. So in this case, everything is processing parallel, right? You have the input, you compute some transformation of all the inputs, and then, boom, you get the output. Here you'd have the final target, you measure the distance, and then, boom, you perform back prop. Vertically, there is no temporal information anymore. These items, you can swap them, there is no information. Each item can look at every other item. That's why it's called Bermutation Equivariant, right? You can change the order of the items, nothing changes. But then how do we tell that this is a sequence? It's not a set. Well, you have to provide a additional information to each input, which is basically telling, oh, this is the first input. This is the second input. This is the third input. This is the fourth input. So that even if you shuffle the items, the network still knows that this is gonna be the first item, this is gonna be the second item, this is gonna be the third item, this is gonna be the fourth, and fifth, and sixth, and seventh, and whatever, right? The network doesn't know about the order because there is no order in the attention model, right? Therefore, you need to provide the order as additional information to the network, okay? What is the correct terminology, as you can tell, because we already see what is the encoder, and this is the encoder, right? The self-attention is the encoder part. What is the left-hand side? It's the predictor, right, which is mixing the hidden representation from the source and the hidden representation for the target. Delay target, right? What is the last module? Of course, the last module is gonna be decoder. This is the model of the transformer which has encoder, encoder, encoder, encoder and delayed encoder. Predictor, decoder, right? Yeah, no. Cool. This is coming from the Z-transform, okay? So this is like digital signal processing. You have Fourier transform to see the frequency domain. Now you have the Z-transform to see an expansion in the complex plane in order to have convergence of functions, like transforms of function that cannot have the Fourier transform, okay? So this is like digital signal processing. I'm an electrical engineer. Who forgot a lot of things? So I spent a week restarting this stuff. But again, just for one symbol convention. Finally, let me exit this thing, this card. I'm gonna be showing you the notebook that is used to use this network and I'm just using the encoder, right? So we go in work, GitHub, PDL, conda, active, activate, PDL, Jupyter notebook, okay? So this is going to be the transformer. All right, so the main parts are gonna be the following. So we have a, yeah, I call soft argmax, that whatever people call softmax, right? Because it's wrong name. So I have multi-head attention, how does it work? So this multi-head attention is the following. So here I have the number of heads. As I show you before, we can have multiple queries, multiple questions per given item in the set. And then I have this D model, which is just the size of D. So here we're gonna have these three matrices, the WQ, WK and WV, which are the rotation matrices, which is mapping my X, the dimension for the XQ, the dimension for the X for the key and the dimension for the X for the V, right? To the D model, right? Internal dimension D. And then I have my final WH that was the one that was mapping down the output to whatever dimension we want. So this is gonna be the initialization with all the weights. Here we have the scale.product, which is simply computing, first of all, this division by beta, which allows us to have things which have the same length in whatever dimension. And then here we're gonna have the scores, which was the matrix multiplication of my K transpose times the Q, right? We remember that we had the matrix, K transpose times the vector Q. And we have multiple Qs. Finally, we have these soft arguments, right? To give you the probabilities. And so finally, you have that DH, which is the sequence of the set of columns, right? Was the V matrix multiplied by these A vectors, right? So the V matrix times the A is gonna give you the first H, then it's gonna be V times the second vector in A, then I have the second H and so on, right? And so this module here, it's simply implementing that key, the query keys and value mechanism we saw before, okay? So how does the forward module work? So we have the capital Q, which is going to be the rotation of my X, Q. Why do I call this X, Q, X, V, X, K, X, V? Because before we called this, I think X and also those two can be X if we have self-attention or if you're doing cross-attention, these two over here are going to be instead of my X side, right? When I call my mom at the phone. Anyway, so that's why they have three different names, right? But potentially the two on the right-hand side should be the same, right? Anyway, I just put extensive names such that we can have the full custom ability. So we have the Q, my Q matrix is gonna be the rotation of these X for the query. Then I have the capital K is gonna be the rotation of the XI or myself. And then I have the V is gonna be again, the rotation of these guys over here. I perform this scale.product, which was that one over square root of beta of these metrics vector, right? The checking each keys against one query and then I have multiple queries. And then I do this linear combination of the vectors in V based on these coefficients I have in these A metrics. And again, I have as many items in A as the elements in my set T, right? Because I have T questions. I'm gonna have T final mixed values, right? Cool. Here I'm gonna be grouping heads, doesn't matter. And then I'm gonna get the final version by doing the changing dimensionality given that we had multiple heads, okay? Here are some tests that are just checking out this correct size and as well here. And so what are we doing here? We have to create a convolutional net, right? When we have the transformer, we have seen that it has two items inside. We had the attention module, self or cross or whatever, and then on top, we had this convolutional item which is operating a transformation for each item. It doesn't have receptive field, right? Receptive field is one, just check the given item. And so simply like we have two layers of, like we have two linear layers or whatever because again, it acts on the single item, right? And then we had the final, like we have a non-linear function. And so my convolutional layer that is operating on each sample in this case is just using the linear, but again, this linear is gonna be the same as a convolution of, with a size one, window one. And so you have the X is gonna be the output of the convolution. We send it through the non-linearity and then we have another rotation, right? So rotation, squashing rotation. And so the final encoder layer is going to be this matching like the sandwich between these two items, right? We had a multi-head attention and then the convolutional net. And then we had, remember, right? We had the receivable connection and the layer normalization. So how does the forward look? That's pretty much self-explanatory, right? So we, for example, for the self-attention here, right? We are gonna be providing a multi-head attention. Like we're gonna send my X three times, right? Because X for the query, X for the keys and X for the values, right? And all of them are gonna be X, so we have self-attention. Then the output of this self-attention is gonna be the input plus the output as we have seen from the slide before. And then we send this residual item here through the layer normalization, as you can see. Then there was the second part. We get this output over here through this convolutional network. And then again, we sum to the output of the convolutional network, the input, right? So this output one was the output of the first module. So this is gonna be the input to the convolutional network, sum to the output of the convolutional network. And then we send this through the layer normalization. And so this is how you can write in five, 10 lines, basically, encoder, a transformer encoder, right? Here we create some embeddings for the positions, but we don't really care. And then the final encoder, what was the difference here? So this was the encoder layer. The final encoder may have multiple layers, right? So if you have a whatever number of layers, they're gonna be appending multiple encoder layers, right? And then we also want to have these embeddings that are giving us the information regarding the position where all these things are, okay? And that's it. Then here we train these to do some movie review, okay? So there is like a classifier in which we have the encoder and then we have a linear layer. And then we send, we're gonna be using a final cross entropy to train this, okay? So we have optimizer and so on, right? And that's it. So we send the forward. We're gonna compute the loss, which is in a cross entropy. We zero the grad. We compute the partial derivative and we step in the opposite direction of the gradient. And that's pretty much it for today. All right. Whoops. We went a little bit out of time. I hope it was clear. If you left because it's definitely late, you have the recording. Let me know if there are any questions. I will update these slides because I realized there was no clear uniform notation with the previous lessons. Let me know if there are questions and otherwise I see you next week, which we're gonna be talking about latent variable energy-based model for speech recognition with an invited speaker, okay? It's gonna be very fun, very cool, I think. All right. Thanks for your attention. Sorry for being over time. I hope it was clear. These transformers and attention are very popular and very effective and have been giving us breakthrough results in the recent times, right? So pay attention, rewatch the lesson, check the ending if you're already left. Thanks for being with me. Have a nice end of the week. Bye-bye.