 All right. Good morning. Good morning. 9 30. It's time in New York City live. Welcome to class. I forgot what lab is this. I don't count anymore. It doesn't matter. I think. So to before starting today lesson, you perhaps already know that I try to teach not only you, but everyone, especially on I provide my service for free on Twitter. And other I also learned from Twitter, right? I mean, it's like a mutual learning experience, right? So I think it's a it's a really powerful tool. We can all like we I mean, people in academia can use to get more information and more knowledge out of, you know, the Internet. So I start actually by, you know, sharing something I would love to see more often, especially by my students, right? Which is like how to take notes from papers. The latest papers, there are so many papers that are like 100 more than 100 papers every day coming out is like ridiculous. So let's see how we can put some order, right? At least for things that are interesting for us. And by doing that, I will share the screen desktop one. So we go on Twitter first, right? So how do we use Twitter? You go on top, right? You just write from me. And then you can do information retention. Okay. And so you get this post from actually, a few hours ago, where I'm pointing out that you'll be bite. He is making some sort of summaries of papers that he's interested in, right? And so if you click here, you get redirected to Notion that I also talked to you about two weeks ago, I think. So here you have like tags, for example, you know, what are the possible topics of the paper when it was created, the author, which is just a list of names, the link and some comments, right? And then the important part is the following, which is the template you can also get if you check the tweet. So you should always start and annotate each paper with a default premade template, which helps you in the process. You should start with the why, then you have the what, and the how, right? Very few words describing this content, right? It should be like a tweet-like kind of amount of words, right? So it's that you just have like a condensed version instead of, blah, many lines. It's just a way of engraving the information in yourself as well, right? Because when you type or when you write, things get memorized better. So I would really recommend you to do something similar with the why, what, how, then the TLDR, which is too long, didn't read, like you didn't read the paper. So you had like one liner or whatever that is describing the content of the paper. Then there is this Python block where the main concept, I guess, of the paper is summarized in terms of NumPy or PyPors, TensorFlow, whatever type of algorithmic implementation. And then there is like the main formulas about the paper, right? So you have the main equation. Finally, you may have some end, no? And what? And then you have like, you know, some final remarks. Also, there is a note that, you know, perhaps it's important, right? And then again, at the bottom, you can find the whole list of things, right? I have something similar, right? But I share the type of, I share that I have a different type of, you know, metadata, which are the following. So if we go back here, I think uh, quote, tweet, listen. So I have the following metadata instead. And this is too small, maybe? Let's see if I can zoom. Okay, we go. So we have topics, I have the URL, what I want to do, the authors are also items, priority in my to-do list, code, completion, referee, where I found it, the value, how much it's worth this paper for me. The project link, so project to my own projects, if I'm working on this project, like I have shared databases, and finally the blog, which is important in order to be able to communicate, okay? So this was the initial remarks that I really care in order to be able to, you know, process and retain information so that you can, this is an aid for learning, okay? So if you are not yet structured in your learning from internet, this is going to help you a lot. Not only for papers, I have similar things for other, for other type of media. Talk about this next time. Enough promotion and, you know, advertisement. All right, so one more thing. We talk, finally, I'm going to be answering the question from, I can't remember. This is a, this is a question I couldn't remember who asked it like a few weeks ago. And it's like, why do we need to go in a high dimensional space? And I show you that animation that was very smooth. Instead today, I'm going to be showing you this other animation, okay? So this is, this is a network that has only two neurons per each layer, okay? So the first hidden layer, second hidden layer, third hidden layer, fourth hidden layer, all of them have only two neurons, okay? Then I have a two neuron embedding, which means it's simply a linear projection without the nonlinearity, okay? So the orange one is just a fine transformation, whereas the green one are these neurons with the activation function. The first layer, the pink one is going to be my input layer, the blue one is going to be my soft argmax of the final embedding, and E stays for embedding, this linear final transformation of these neurons, okay? Again, here you can count one, two, three, four, five, six layers, right? There is no nonlinearity, so this doesn't actually count as a layer. Let me show you how this network work. For in this case, we try to desparalyze that spiral. I showed you last time, okay? But we do that only in the plane. So we only go from 2D to 2D to 2D to 2D to 2D to 2D. As you can tell here, in order to be able to stretch things in a 2D space, I wish you could have simply rotate things in a different angular velocity, basically given the distance from the center. Instead, what the network learns is the following, which is a really stretchy way of undoing the warp, the entanglement, right? And it's really brittle. As you can tell, these sharp edges that are coming from the application of the Erilo in several, you know, across the network are really like getting some of these areas very compressed, other pieces completely like not compressed the other way around stretched, right? And it was very hard to train. This is an interpolation for each layer. So we had first the first linear layer, and this is the Erilo. As you can tell, the negative part gets squashed back, right? Again, a fine transformation and then a squashing function. You can see how quadrant 2, 3 and 4 get cut, right? Again, a fine transformation and then quadrant 2, 3 and 4 get 0. And so everything that happened there gets squashed down into the axis. Finally, the last transformation, like linear transformation, like a fine transformation, then final squashing. Some pieces are left off because I'm using a leaky Erilo, right? And that's the final transformation. And this is the embedding layer where I then split with the three planes. These are going to be the SVD, the composition. So we had the first fine transformation and the Erilo. So we have rotation, reflection, zooming, rotation, reflection, because the determinant is negative, and then the final part, right? So again, here, same, right? So you have SVD, the composition, and then reflection if the determinant is negative, okay? The determinant of the rotation matrices. Again, you should have watched the SVD, the composition from Gilbert Strang on his video. There is also three blue and brown on YouTube showing, explaining how again, vectors and SVD, the composition work. Please have a binge watch these videos because the understanding or getting familiar, not familiar, getting confident of how this transformation happened and why they happened and how they are made. This is like a very fundamental part that is going to help you develop an understanding of what's going on, in this kind of more complex, I mean complex, succession of a fine transformation, nonlinearity, right? Rotation, squashing, rotation, squashing, rotation, squashing. Okay, so I spent the first 10 minutes giving you finally the answer, I think, of the question, why do we go in a high dimensional space? As you can see, the final result of not going in a high dimensional space in an intermediate layer, results in a very, very like a brutal kind of stretch, which was hard, super hard to train. Trust me, I spent quite some time to make that animation, not just for the sake of the animation, but for the actual training. Okay, so those are toy examples, but you can understand why some things are hard, even with toy examples. Whereas whenever I show you the other network, the first animation, the smooth one, that was going from 2 to 100 to 2 to 5, right? But again, those two can be ignored, so it was 2, 105. This is 2, 2, 2, 2, 2, 2 and 3. Question for the people at home, how many lines, how many folds can you count in that video at the end of the whole transformation? Okay, you have to tell me next time. Also, there was another question, I think, from last class. You should have told me. I had to check those questions. Someone should write down these questions and figure out what they were. Okay, someone, some TAO graders should take care of these questions I give for home exercise, okay? Okay, so what do we talk about today? We talk about the new homework, right? I mean, I'm going to give you more information such that you're going to be successful in completing the next homework. And this is going to be about recurring neural networks. So last week, Jan started the lesson by talking about parameter sharing. And then we introduced recurring neural networks for one hour, and then we moved to a second hour of convolutional network. Both of them, both these architecture are sharing parameters in different manner. And so the recurring neural network doesn't necessarily operate on a sequence that is in one dimension, but it can also operate in sequences that go into dimension, right? So as long as it's a grid of one item after each other, you can apply a recurring network in this direction, a recurring network in that direction. If it's a volumetric, you also can go in the other direction. And then you process one sample at a time. Similarly, to how a convolution can operate, right? But there are some differences. So I'm going to be starting, unless there are questions in the chat, with this lesson of today, with 15 minutes delay. Foundations of deep learning. And that's me. Cool. So today, we're going to be talking once again about recurring neural nets, handling sequential data. Again, not quite, not only at least, you can do sequential data if the data is one-dimensional. And when I say the data is one-dimensional, I actually mean the domain on which this vector function, the vectorial field, is working. What is the domain, right? So if it's one-dimensional, it's going to be just, the domain is going to be one-dimensional. If it's a two-dimensional signal, you have two dimensions, three, et cetera. So this is vanilla classical neural net, not recurrent. So you have an input on the left-hand side in pink, then it moves through a rotation and squashing to get the hidden layer, right? So in this case, these single circles represent a vector, okay? So before each circle was representing a single neuron, but they also can represent like a vector neuron, right? So this is like the collection of all the inputs. So X is a vector, the pink thing. You want to think it as it comes out from the screen. So again, input vector, X in pink, rotation, squashing, you get the hidden layer, rotation, squashing, you get the final output, which is called Y hat in blue. And if you're an electrical engineer, you can think about this as a combinatorial logic. The output depends on the current input, and that's it. On the other side, on the bottom, you have now a recurrent neural network. So the differences are two here. Instead of just having X, H, and Y hat, we're going to have H of square bracket T, which is telling you it's a signal of a discrete index T. Then you have H of discrete index T and Y hat discrete index T, the square bracket for in a computer, in a signal processing, determine the fact that it's a discrete signal, right? And there is a connection there in the loop of the H, which it's not that crazy as in, oh, I don't know exactly what's the value of H. Is the H, how many times do you loop it there? No, so there is actually a delay module in the loop. And it simply says that the next value of the H equal the whatever rotation of the input plus the rotation of the previous value, right? So we have, again, discrete intervals of T, temporal interval. And again, if you're an electrical engineer, you maybe think about this as a sequential logic, where the output is not only function of the input, but is also function of the state of the system. And so now we introduce the state. The system has a memory. Similarly to electronics, we also can reset the system by, you know, turning to zero the internal memory, for example. Cool. If we use yarn notation, and again, these were neural diagram. Yarn uses the evolve neural diagram. We have that, as you can see in the vanilla network here on the top left. We have this little projectile, not this bullet symbol, which is representing a rotation and squashing together, okay? And so you have a rotation and squashing there, rotation and squashing there, there and there, okay? But again, for me are implied, they were implied in the previous diagram. All right, so finally, we, I think I will go quite rather fast on this part because it's just motivation. And I think we spend quite some time about the motivation. So I will try to be a little bit more speedy on this part, okay? So what can we use this recurrent network for, okay? In this case, I will consider applications for one-dimensional signals. Nevertheless, you can have applications of RNN for two-dimensional, three-dimensional, whatever dimensional grid, the domain, right? So the domain is the grid on which values, on which locations you have a vector value, right? Or a scalar, whatever you want, right? But usually you can think about this number of channel specified at every location in the one-dimensional grid, two-dimensional grid, three-dimensional grid, and so on. Type questions if I speak too quickly, okay? All right. So in this case, something I can think about is going to be the vector to sequence. I provide a input, which is my first or single vector x, bold x, and then I get a first output, the blue one. Then I provide this output back in the, as would be the input, and then I provide a second output. Then I provide the second output back to the input, and then I get a third output, and so on, and so on, and so on. And so basically, given one vector as input, I get a sequence of vectors. So I can convert a vector into a sequence by using a recurrent neural net, okay? Question for people at home, what could be a application that uses this kind of diagram, okay? What can be one vector and then as input and then a sequence as output? Guess? Generate music, and what is the input? So translation, translation, you actually have a sequence as input, right? You have the target language, like the source language and the target language, right? Okay, so let me try to show you this one, right? So for example, my x could be an image, okay? So my x can be an image or the embedding of an image, like the last representation of a convolutional net. And then the network, the recurrent network will be trained in order to produce a sequence of tokens, a sequence of representations representing words. In this case, we provide an image or the embedding of an image, which is being produced by a convolutional net, perhaps. And then we force the network to output the following, right? So in this case, these are the result of a network on the test set, and you get a person riding a motorcycle on a dirt road. Awesome, right? So the input was a vector, which is the this image, which again is not provided, is not, okay, you can think about this as a vector, but we also know that this is like a signal, is a signal, which is a three-channel signal defined on a 2D grid, right? And then this one has been the composite and hierarchical, hierarchically decompose and assemble into getting perhaps, you know, a vector to do classification, but we can go one layer below and get this kind of representation, the embedding representation, just before the final layer. And so it can provide that vector, and we send it to the recurrent network, such that it has some context. And given that context, we're going to be training the recurrent network such that it can produce that sequence of tokens that eventually we can decode into this text, okay? So we provide this representation of this image, and then we get a person riding a motorcycle on a dirt road, or in the second case, we get a group of young people playing a game of frisbee, fantastic, you know, this is, I think, almost crazy. You provide an image or the, you know, the representation of the context, the representation of the context or the image, and then the network is able to spit out a sequence of characters or a sequence of tokens. We don't know what those tokens are, but okay, a sequence of one hot encoding vectors representing the words in a dictionary. In the last case, we have a herd of elephants walking across a dry grass field. Sounds that, you know, AI has been solved, no NLP, no captioning has been solved, not quite. Let me click through. So sometimes we don't have such accurate predictions, okay? So in this case, in the first one, we have two dogs play in the grass. I count three. Two hockey players are fighting over the plaque. To me, seems correct. Maybe I have no idea. There is only one dude here. Third one, a close up of a cat lying on a couch. Looks like a bed. If we can check worse result, we get a skateboarder does a trick on a ramp. This is a bike. Or a little girl in a pink hat is blowing bubbles. She has a pink hat, maybe red and white. And I don't know if it's a girl, but okay, there is no bubble there. Or a red motorcycle parked in the side of the road. It's not red nor a motorcycle. Or it can also brutally fail, okay? And then we have like a dog is jumping to catch a frisbee. No, it doesn't. It's not jumping, right? So you got the action wrong. Maybe the action may need some, you know, video rather than an image, but I can still see a dog flying and I can expect it was jumping. Or a refrigerator filled with lots of food and drinks. Not quite. Yeah. Or a yellow school bus park in a parking lot. Again, it got something, right? Not quite everything, right? So there are some mistakes. Anyway, so this was to show you that we can convert one vector into a tongue, tongue, tongue, tongue, tongue, tongue, sequence of tokens or sequence of representation that are corresponding to the words you can read here, okay? And so this is one first application of recurrent that convert a vector to a sequence of vectors. Questions? No? Okay. So we're going to be going, again, I was saying I'm going to be a little bit speedy because we're going to be covering some more stuff that otherwise we get is too rushed. Back to this diagram. And we have the second type of application. Second type of application instead is going to be sequence. Sequence to vector, right? So the other way around. So I represent here my, if X, T is just one element, I have like the curly bracket to represent the whole sequence, right? And so T equal one to capital T. And then so this is going to be like one, two, three capital T. Thank you. So Muhammad is suggesting sentiment analysis. That's perfect, right? So you provide like, let's say the Amazon review, and then you're going to say, oh, how many stars you got, right? Or I'm going to show you something a little bit niche, right? I mean, I just like it because it's niche. But it's not the best example. I think the sentiment analysis was a better one. Again, this is a sequence to vector, right? So we update, at the beginning, we have a zero state, right? We initialize our internal memory, we provide input, then the internal state basically evolves. There is like a trajectory of the internal state that is changing until we get to the final destination provided, given that we provide a sequence of input. And then that one gets decoded back into this final target, which is the final prediction, the blue one here. Cool. So learning to execute, this was like mind blowing when I saw this stuff. And still I don't make, it doesn't make sense to me, like it's crazy. And also the other are Zalemba and Susuke better. So, you know, also those are two big guys these days, right? They were also, you know, this is awesome, I think I like it. So the input is the following, okay? It's a sequence, which is j equal 8,584. For x in range 8, j plus equal 920, b equal parenthesis, I don't know why, 1500 plus j print double parenthesis, b plus 7,567. The target is 25,011, okay? So, although here the output can be thought as actually being a sequence of characters, in this case I'm just, you know, supposing that it's just a scalar, okay? So the network learns how to execute programs without executing them, okay? This is not, okay? So you provide a sequence of text describing a Python program and you train a neural network to actually give you the answer of the program, okay? This is like non-stick mind blowing to me. Isn't for image captioning first, we need to extract its features in the case of we have many features that we feed into the RNN. Yeah, so I was saying before that for image captioning, you may want to use a network, perhaps that you have trained for classification or in any unsupervised manner that we see in a few weeks, which has already learned how to extract a hierarchical decomposition or a hierarchical summarization of the features, right? So yes, whenever you have an input, which is a signal over a 2D grid, you're going to be using a convolution in order to extract information, okay? Once we have this higher level of representation is embedding, we can provide that vector to the recurrent network such that it can operate on vector rather than on signals, right? Although that the recurrent network will operate on a one-dimensional signal, right? That is the sequence of items. Yes, so you're going to be using multiple architecture stuck together, right? So it's going to be like, you know, you build your final model as like a construction of multiple LEGO blocks, okay? The other example here moving, coming back, so that's a Turing machine with a neural net. We also have Turing machines, that's another article. This is just, you know, training a recurrent network to get out a approximate version of the output, I think. Not entirely sure if it's a Turing machine, this one. There are Turing machines, someone did train a few Turing machines with a neural net. This is another example. And, you know, similarly, you have a different final output, okay? Again, this is also mind blowing to me. Back to the previous diagram, we cover vector to sequence, sequence to vector, what's missing? For example, we have sequence to vector to sequence, okay? And this used to be the standard, the state of the art, manner to perform a translation. This should be an easier problem compared to NLP, right? Which one? Write the whole sentence, please. I can't, I don't know what you're talking about. All right, so this is like the classical learning to execute. I think it's like natural language processing, right? You're learning, you're processing, oh, okay. I guess it's not ambiguous, right? I mean, the Python code has a deterministic output, whereas natural language processing, like natural language is hard to actually get, you know, a correct answer because there is no such thing as correct answer, right? In languages, there are, you know, different kind of shades or how you can call them, hue of possible interpretation of the meaning. Yeah, so with the Python one would be like a deterministic result, right? Unless you have random NP dot random or whatever. All right, moving on here back to sequence to vector to sequence, right? Which is there is an intermediate representation, which is a vector representation, which has condensed the temporal information. So you have a temporal information, you condense it down in a vector, which I like to call it concept vector. It's a vector representing the concept of that, you know, expanded version, temporal version. And then given the context, the, what did I say, I call it the concept vector, I get it back to boom, expanded back in a temporal domain, right? And so we have this kind of temporal collapse or temporal compression, I would like to call it, right? So you compress time to frequency domain, no, no. To learn that, I don't know, I don't think it's necessary to learn the time to frequency domain. But the point here is that you can collapse the time and then, you know, and you convert that representation, right? The temporal information is converted into like a, let's call it descriptive information, right? It was using the same name as we used yesterday. And so then the descriptive information back is converted down in the temporal information, right? Yesterday we saw that the spatial information was converted by a convolutional network into a descriptive information, right? Which is like this. And then also if we're using something called an auto encoder, we're going to be getting these descriptive information back to the spatial information. Similarly here we have like temporal information collapsed in this descriptive information and then back to these other temporal information, okay? Cool. So I told you already this is an example, classical example of translation where we used to condense this first source sentence into like this meaning, sorry, the meaning of the source sentence into this concept vector and then you unpack it in a different language. And you may have different decoders for different languages too. And so something once you train this system to perform translation, you end up with very interesting algebraic structure of the of the actual embedding space. Let me show you what this actually mean. So here I show you a few diagrams where you can see the PCA of the hidden representation. And there are clusters, right? There's going to be a green cluster here, a cyan cluster and a yellow cluster over here. Let me zoom in a little bit. So the green one was all the months. All months are nearby the same location, right? Which means they are semantically replaceable, right? So if you swap a month for another in a sentence, it doesn't change the meaning of the sentence but it's not going to be grammatically incorrect. It snows in December. It snows in August. Maybe if you're on the other side of the globe, right? It's even semantically correct, right? But nevertheless, grammatically it's correct nevertheless, right? And so these are their embeddings. They are there, you know, the encoding. It's really similar because they can be swappable, right? Similarly, in this other case, we have, you know, all temporal embeddings, right? So we have one to three months, two days before, four nearly two months over the last two decades, blah, blah, blah, right? So all of these can be swapped in and out without, you know, breaking the language. They will change the meaning but nevertheless, they have their representation which is very close to each other, okay? So this is super interesting. Finally, we have some, as I was mentioning before, algebraic structure that this space learn, which is, for example, if you have like the distance between king and queen, it's the same as man to woman. And so if you do, king minus man, you get here plus woman. So king minus man plus woman, you're gonna get down to this point here, okay? And so I don't know, I think this is like super interesting how you can actually end up getting math or like geometric representation of the words, right? So there's a semantic space now you can perform operations. Or the fact that the distance between walk minus walking, which is this arrow, is going to be the same as swam, swam minus swimming, right? So the distance that is connecting the present continuous to the past is the same as the, again, present continuous and passing this for the other verb, okay? Although this isn't, it's even irregular word, right? And then similarly you have that, these vectors connecting the country with their capital are all the same. If you do like Spain minus Madrid, Italy minus Rome and Germany minus Berlin, all of these vectors have very, very high degree of alignment. What are these axes here? What does each axis represent? In this case, these axes are just for pictorial representation. We are just talking about the embedding of, you know, we are talking about the hidden representation. So if I go back to this diagram, each point in that vector, in that dimension, in that picture there, is basically this vector over here, right? So the last hidden layer. So you provide bam, bam, bam, the three inputs, you have this hidden layer over here. And then given these three hidden layers, you know, in whatever dimensional space, let's say, I don't know, 256, in this 256 dimensional space, you're going to have this algebraic structure. And that was just a pictorial representation of the first three components, let's say, of these vectors, okay? So just for the representation, it's not actual, it was not, you know, correct, it was just for intuition. But the other diagram, the one I show you here, these are like a PCA, right? So these are the two principle components of the embedding layer, the hidden representation. Finally, we are almost on time for this part. Finally, we had the last one, which was mentioned in the chat here. These are three dimension reduced outputs. They are not outputs. Those are the hidden representation, okay? And it's not even reduced. Those were, I think, you know, just, you know, pictorial representation. But yeah. The final one is this one. While you provide an input sequence, you're going to get a output sequence. This is going to be in the homework, okay? What's an example here? Like, if you're old as I am, you might remember the T9, okay, on the Nokia phone. As you type things, the phone tries to predict what you're saying. Oh, we have also something similar right now, I think. So there is a predictive algorithm trying to tell you what you were actually typing in order to, you know, have less to type. But I guess no one types anymore anything. There is like a swipe, you know? So why would you actually type now things? We have screens, right? On the mobile phone. But once, you know, upon on time, we actually have keys. So there's a pain to type everything down. And because you actually had to press multiple times, you know, to press C, you had to do ABC with a number two, no? If you want to press the S, you have to press number seven, one, two, three, four, no? I don't know if you remember this stuff. Yeah, that's crazy, I think. And something I can show you here is going to be this editor, which has been trained. Yeah, you didn't have to look. I know. I will remember, you know, even at yes, you're right. You didn't have to look at the keyboard. Now you had to look at the keyboard before we actually could type messages without watching by using the tactile feedback, right? Good point. All right, all right, all right. So let's say we train the system on on science fiction books. Okay. And so if you train the system of science fiction book, you're going to get something like that. You type the rings of Saturn glittered while the harsh eye two men look at each other. What's going on here? They wear enemies, but the server robots weren't what I don't know. So if you're not a very proficient writer, or you have maybe not many ideas about how to write a new novel, just train your neural network, your recording your network on the latest novels, and you're going to get your assistant providing you, you know, tips for your for your new novel. Okay. Okay, jokes aside, you can find more about this one on on on on GitHub, right? So this is there. Something else, which is so so interesting, but it's not a record network, still text generation is the following. And this is just limited here for a second such that you can screenshot or whatever you can watch the recording later. And the content here, it's ridiculous because it seems like it's written by a human. It's not right, but it's crazy. All right, moving on. Finally, we might even end up on time today. How do we train this recording neural network? Okay. So is this, you know, recursion the fact that your current value depends on the previous value, any crazy thing? No, everything is just very plain back propagation. But then how do you handle the fact that there is, you know, time dependencies? Okay, there is no time dependencies here. There are discrete interval time, right? So we're going to be simply unrolling the time, we're going to be getting a network which is going in two dimensions, right? Like, sorry, in the one dimension. So we expand the network. But we have to keep in mind that we are going to be applying parameter sharing. What do I mean? Okay. Now I'm going to tell you what I mean. All right. So this is the classical network. Those are vector, right? I have the input, the hidden and the output on top. And these are the final, the normal equation. We have the hidden layer is going to be the squashing of the affine transformation of the rotation of the X. And then you have the Y hat is going to be the squashing of the rotation of the hidden, we know this stuff. And enter the recurring neural network. So what are the differences? I think very tiny difference. So now we have, these are the equations, which may look a little scary, but they are, I'm going to be going through right now, right? Right now we have that the H, the hidden layer, the temporal index. Now these are indexes, not variable. Temporal index T is going to be the squashing function of this WH, which is exactly the same WH as before, of a new vector, which is not just the X vector, but is a DX concatenated also with the previous value of the hidden layer. Okay. Plus they translate the bias, no, the offset, whatever you want to call it. Moreover, I'm going to be starting somewhere, right? So whenever I start, we said we count from one. So I have one, two, three, four. Whenever I start with H of one, we're going to have to input something here for the H of zero. So I have to define H of zero is going to be zero for me. I just reset to zero. Why zero? Because it just kills basically everything that is in the multiplication. This notation is here for two reasons. So the first one is just to make it similar to the previous notation. We still have the WH times a vector, WH times a vector. And also the second reason is that if you have a vector, this one, this, this multiplication, you just get hitting one goal. But, you know, unless otherwise, usually you can see this one as unpacked in multiple multiplications, you can write this as, you know, so the WH is going to be the concatenation, horizontal concatenation, row concatenation of the WHX and WHH. And then this is going to be the vector. So you have matrix, matrix, vector, vector, which is the same as doing matrix, vector, matrix, vector. So this is the same as writing WHX times X plus WHH times X of T minus one plus B. Finally, you have the output is simply the quashing of the affine transformation of the hidden at the time T. So no big deal, right? So the last equation is the same. The first equation is the same. The only difference is the fact that there are indexes, discrete indexes T. And now we have that the input is no longer just the input, but it's the input and the state of the system. The state of the system is represented by the previous time step hidden representation. Could you explain the initialization of H zero? So I just have to start somewhere, right? So I have to do something with the memory. In your computer, you cannot read if you write a program in C, you cannot read a location of memory before you actually have written anything. Well, you can by going to get garbage out of it, right? So if you want to write something or accumulate something, right? You want to, let's say, so if you want to, you know, start a, how do you say, how do you call it, like a counter or whatever, no? You write I equals zero and then you have I plus equal one. So you start somewhere zero and then you start accumulating things. If you start summing to I and I was not never initialized, then you have no idea what you're going to get, right? So we define by, I mean the value of Y all zero. So I just zero out my memory. At the beginning I have no better knowledge, no? To what to put in the memory. So I just, you know, erase my memory and I start using, I start putting things inside the memory. That's why initialize usually variables to zero or whatever, right? So there is no main reason to initialize to zero. It's like zero padding. Zero padding, maybe it's okay because we subtract the mean in the representation, but okay. So no answer, maybe a confusing answer. So how to train this stuff? We simply unroll it in time, right? So I have a replica at t times t minus one, a replica times t, a replica t times t plus one, right? And then how do we, how do we train, right? So you have like here, we're going to have like a loss function, right? And we go this stuff inside here and then we have a loss function, perhaps with this one here. So we go inside here and then a loss function over here. Or maybe a loss function that has all these three components together. And then you're going to get a gradient that goes down in this direction. And then this gradient also goes down in this direction. And also this gradient goes down in, you know, against all the arrow, right? So whenever you have big propagation, you want to go in the direction opposite to every arrow you have here. And you keep going, right? So, and then, yeah, we're going to get the gradients down here, they goes down, they have a gradient go down here and down here. And then finally you have the gradient goes down here, okay? Possibly you're going to get a gradient coming down from this from the future and possibly also you're going to have a gradient going to the past. And so this is the back propagation through time, which was a nightmare. Maybe if you had to code this by hand, we don't ask you to code this by hand. You know, with PyTorch, we simply get back propagation. What does back propagation do in PyTorch? Can you remind me? This is important part, right? Why is it important on this part? What do we say that back propagation do? Are you with me? They compute the gradients. Okay, someone is actually correct right now. Muhammad is correct. Accumulates gradient. Why do you accumulate the gradient, right? Because once you go down here, you computed the gradients of this one, right? With respect, the loss with respect to this one, when you compute down here. But then also when you go down here, you want to still add on top of this. If you're simply computing the gradient, you're going to be killing what was done previously, okay? Instead, by accumulating, you simply refine what is this value, right? Because again, all these Ws here, maybe I should change color, all these values here are the same parameter, right? So whenever you compute the gradients, those are the same parameters shared by all these networks over time, which are the same network. And so whenever you compute not only yet this one, right? Also when you go down in this direction, these gradients are going to be accumulated here and here. And here, so the given gradient will get how many calls. So if I look at this item over here, how many updates will this gradient have in this diagram? Can you count? Type in the chat? No, it's not three. It's more than three, right? Why is it more than three? For sure, five, yes. So three of here, two for this one, plus one of, oh, six actually, right? Yeah, six. So there are six updates to this gradient with this parameter, okay? So this is the back propagation through time, accumulation of multiple partial derivatives due to the fact that it's a weight-sharing network, okay? Okay. Moving forward, I train an example, okay? So let's say here I'm going to have this sequence of symbols. I just put letters such that we are a little bit easier. It's a bit easier to refer to each of these. And so I'm going to be batching this. I'm going to be splitting A, B, C, D until the F. There's the first batch down here, then G, D, D, D, tag. So I have different columns representing my data, right? And so whenever my batch size here is number four, so I split my entire sequence in chunks, four chunks, and then I put these chunks, you know, in a packed manner. And so in this case, how do I train, let's say, a language model, which is going to be trying to tell me in advance what I'm going to be saying given what I said. So if this is my input batch, which is going to be ABC or GHI or MNO or STU, I will force my network to predict so that the back propagation period is three in this case. I'm going to be trying to predict the fact that if I provide ABC, I want the network to say BCD. So when I provide A, I want to have B. The network should tell me B. When I provide B, I want the network to provide C. And when I provide C, I want the network to provide D. And so on. And I have in a batch, so G, the network will be forced to provide H. I provide H, the network is forced to give me I. I provide I, the network is going to be forced to provide J. So how do we train this network? So I have my input here. I provide it in a recurrent network. I have my first output. Second output, I have the header representation from the previous time interval provided to my network. And then I have the second output. Then I have the third input given the representation of the previous state. I have the final output there. What is this stuff? Why do I put a slash arrow here? Anyone? I end the state, right? So I had to stop somewhere. I took my three items, then I don't go into infinitum, right? So I just have these little chunks. And also have a slash on the left-hand side. So what does a slash mean? So first of all, I have H0 is initialized to zero. But I also have no gradient coming from the past and no gradient going to the future. Because if you have, if this stuff keeps going, then you have a computational graph, which is becoming longer and longer and longer, right? So this stuff, you know, you run out of memory. So you have to chop your sequence and then you learn by on chops on parts of sequences, right? So you go forward in the sequence, then you run the propagation back to that sequence. You get the next sequence, forward and backward, next sequence, forward and backward. You cannot do forward in the whole sequence and then backward because there is no more memory in your computer. I wish you could do, you cannot. All right, vanishing and exploding gradients. You will be asked this stuff in the homework. What happens here, I'm just giving you the intuition here, is that in our current network, you have your first input, which gets, you know, lost through the network. So whenever you run back propagation, the back propagation doesn't really get to the input layer. Whereas if you use something called a gated neural network, which is going to be a similar drawing where instead we have these symbols that are like mouth, if the mouth is open, the signal will go forward. So mouth open, you have the white moving over here, mouth open, the, the, the white moves forward, mouth open, it moves forward and so on. Right? So mouth open here, and then you get the signal to go to the output. And then again, mouth open, mouth open, mouth close, so nothing goes up. And then mouth open, it goes up, right? And then it stops here. So there is no more propagation of the state. And so this is like basically a reset, right? And so here you can tell there are four recurrent neural network. There is a first neural network that is converting the input to go here. So, hold on. So the first input, the first neural net, recurrent net, is going to be the fact that this value over here is function of the input, but it's also function of the previous value, right? So this first neural, recurrent net. Second, the second recurrent net is the one that governs this open mouth, which is the input mouth. Then there is this open mouth here, which is the recurrent network that governs the forget gate or remember gate, I will call it instead. And then there is a open or closed mouth here. So there is a third, fourth neural network here governing when to open or close the output the output mouth. So there is an input, there is a remember, and there is an output mouth or recurrent network plus this recurrent network, the main one, right? So these three items here are called gates or I call them mouths, but they are gates. The mouth can be open or closed, the gate can be open or closed, right? But since it's circular to me, it looks like a mouth. Anyway, so there are three recurrent network plus the fourth, which is the main one. And one of these implementation, in this case, is going to be called the long, short term memory, which was made by this dude over here, Jurgen Schmidt-Kubert, and his student. And so here I'm going to be representing to you this equation, which are the one we showed you before for the recurrent network, which have this kind of neural diagram, okay? We used now different diagrams with young, with those kind of projectiles, but this is what I used to be drawing with a neural diagram, which is representing this equation. Instead, in this other side, I'm going to show you the equation for the LSTM. Don't scream, don't get too crazy, although no one is screaming. So okay, maybe you're fine, or maybe you're not listening, but it's okay. Okay, I'm joking, right? X, okay, there you go. So in this case, these equations are not too crazy, as in these are simply represented by this circuit, okay? And this neural circuit, okay? What does it mean? What do they do? So I give you this one, and then we see the notebook, then we say goodbye. This is the first input gate. So I told you there are three mouths. There is the input mouth or input gate. Then we have the don't forget. So the remember gate, it's called forget, it's wrong. It's called remember, right? Don't forget or remember, because it works in the opposite logic. So I will call it remember gate. And then we have the output gate, which is determining whether something goes out or not. And these gates have the, you know, particular part that is the nonlinear function is going to be this sigmoid, okay? Sigmoid goes from zero to one. So you can have like a multiplier. Multiplied by zero, you kill the signal. You multiply by one, the signal goes forward. Whereas the main neural net, which is this green one, is this item in the center here, which has a hyperbolic tangent, for example. All right, okay, moving on. How does it work? This LSTM, right? So the output can be control and turn off or on. As I told you, if the sigmoid can be one or zero, let's say, you know, a discrete value, I can think as if my internal final representation is purple and then the sigmoid is zero, zero comes out from the multiplication. If my sigmoid instead is green, then the multiplication sends me forward the, you know, whatever, internal representation. That's quite straightforward, I think. Then on the other side, we have the memory, controlling the memory, right? So how do we reset the memory? To reset the memory, you want, so you have an internal state that is this purple one, but, and you have a previous cell, which is this green, blue color. If you put a zero from the input, you're going to have that the input gets multiplied by a zero and nothing goes forward. Similarly, if you put a zero in the memory, you're going to get a zero there, summing two zeros, you're going to get a zero in the internal memory. So this is how we reset, okay? Similarly, if you want to keep, you don't want to get any input, so you zero the input, but then you simply have a one in the memory such that everything keeps flowing and you're going to keep the same blue content. Finally, we can write something new by sending one in the input gate such that the one gets multiplied by the purple and you're going to get the purple, but you don't want to keep remembering or use anything from the memory, right? So in the memory, you're going to put a zero such that you kill whatever you have in the memory. And then finally, you're going to be writing purple in your cell. And that was the lesson. I'm going to show you where the notebooks are. I don't want to keep the annotations. And so I show you where the notebooks are and what we do here, okay? So in the notebook, and we're going to be going to cd, work, github, pdl, and then we do conda, activate pdl, okay? And then we have jupyter notebook. These things are going to be turning on. I move the notebook to the right-hand side. And then we're going to be going to the number sequence classification, number eight. And the bar is again in my face. Go away, bar. Okay. Okay, so what do we try to do here? I can't see. There we go. So here we have four types of sequences, okay? These sequences are going from a length of 100 to 110. I'm going to be executing everything such that I can talk while this stuff is training. Again, you should be spending a couple of hours per notebook when once we cover this in class, okay? So here I just show you the things that actually are running. And you have to spend and get familiar with all the code yourself because that takes time, I know. So here a sequence goes between 100 and 110 symbols, right? So there are 100, 210 symbols. And then a two different location, T1 and T2, you're going to get, T1 can be from 10 to 20, and T2 can be from 50 to 60. You're going to get two identifiers, okay? Identifier can be xx, oh, well, I'm flipped, right? I'm mirrored. So you can get xx, xy, yx, and yy, okay? Or the other way around. xx, xy, yx, and yy. Given that you have this possible four combination, you're going to get possibly four different classes. You have qr, s, and u, okay? So we're going to be doing sequence classification based on these two markers, which may happen, the first mark happens between 10 and 20, and the other one happens between 50 and 60, okay? And in between, there are destructors. And those destructors are A, B, C, and D, four more symbols. So we have A, B, C, and D in the whole sequence, plus these two markers, all right? Our sequence starts with that B for beginning, and this is like the sign language for the B. And then you have the E here for the end, and this sign language for the E. So you start from B beginning to E ending, right? You have A, B, C, D mixed randomly inside the sequence, and then you have these two markers. You want to be able to classify the sequence as being qr, s, or u, based on what you have, okay? I hope it's clear. So here I'm going to be showing you, so how many symbols do we have? Who did count? Beginning and so there are two, A, B, C, D, there are four more, so there are six, plus x and y, eight, right? So you have eight symbols in total. So whenever we're going to be using the converting these things, we're going to be having in total eight symbols, right? Four and eight. And in this case, we have all zeros here. While you have all zeros, we had to pad, because we trained with batches, and then one of the sequences actually was shorter, and so because again, the sequences have different lengths. Okay, okay. So an example of, you know, from the data set, we have B from beginning, B, destructor, capital X, C, capital X, so xx is going to be the first type, right? The q, and then you have C, B, destructors, E for end. So this was a q sentence, and we trained the network to predict q. So q are going to be encoded in one hot, as you can see over here, right? So q is the first one, and you have zero, zero, zero, zero, right? So we trained with a classical cross-entropy, it's a classification task of a sequence. So in this case, I have a simple recurrent neural network, which is using torch and then recurrent neural network. And then I simply have my forward, which I send my input x to my recurrent neural net. In the other case, I have a simple LSTM, which is using this torch and then LSTM. And then here, similarly, I pass my h to my LSTM. We have the training with the five steps. Remember, we have the forward step, we have the lost computation number two, we have number three, zero grad. In this case, because we're going to be processing with the backward, which is going to be computing all those, you know, accumulation of the partial derivatives. And then finally, you have the step in the optimizer. Testing loop the same, basically, without the back propagation. And then we put all together. And we have a train and test function that is just running through these things, right? And so I'm going to be showing here, first, with the easy setup and how they train, right? So I train both the RNN here and then later I train the LSTM. Let me unzoom a little so you can see the whole line. And so in this case, we have that the loss goes down and the epoch for the accuracy, you can see it goes up. Maybe we just train for nine epochs, right? Or 10 epochs is not enough to get everything up to there. If I train instead in 10 epochs with the LSTM, you can see that we reach 100%, okay? But perhaps we should have just trained longer, right, the recurrent neural net. And so if you train longer, the recurrent net, actually it takes, it gets up there, right? It takes a little bit more time. But then what you want to try is going to be changing this from easy to hard and see how they compare. And, you know, LSTM with 100 epochs, you get that after within 20 epochs, we already had 100% here. We were already at 100% in this case after 10 epochs in this first case, right? So it trains faster. But nevertheless, you remember, now in this LSTM, you have four networks, one into each other. All right. So model evaluation, here we can see and then I let you go, how the sequences get classified. And then I can also visualize a few of these models here. And so in this case, you can see how whenever this X is encountered, one specific unit of these LSTM changes from negative value to positive value. And so here we represent this change. That was it, okay? I was running, perhaps a little bit today, I am over time again. I hope it was somehow clear. This is going to be tested again in this first homework. In second homework, we're going to be sending to you. If there are more questions, I will take them on the campus wire. Thank you for being with us and I'll see you next time. Okay? All right. I hope you enjoyed the class and I was not running. Let me know if it was too fast, okay? All right. Bye. Bye-bye.