 Face see like bubble face. I don't know and I have a red motorcycle park on this side of the road Well, that's pink, but maybe it has that tonic And then sometimes you really get catastrophic Things like a dog is jumping to catch a frisbee. Oh, well, nope refrigerator filled with lots of food and Maybe you know when you have American refrigerators you open those things and it's very messy So this is actually, you know, very messy Then a yellow school bus park in a parking lot. So here you can see some bias Usually in US school buses are yellow and given that there is a vehicle yellow is I ah school bus You know, it's learning still how to speak. Maybe it doesn't really have a clue what a school bus is, right? So, you know, it gets something very well something gets not good Again, we have a conversion between pixels like intensities from 0 to 255 times 3 because you have RGB times You know, whatever your grid of the image and you are converting this kind of representation Which is a very high-dimensional space into a sequence of English word, which actually make even sense Although sometimes they don't describe correctly the same. Yes Yes, yes, that's the whole point this guy learned a language model, right by for free You actually learn how to speak by just trying to describe those captions for those images But this is actually also very Absolutely exciting. I think it is from from my perspective. Yes So again Yes Will it say grapes sitting in a field or will it just say grapes it depends on what is trained, right? So these are the captions that have been so though that the grammatical structure of the captions is preserved When the the network learns how to describe those images So if you check this data set the cocoa the cocoa Microsoft cocoa data set you can check some examples and see In which form they are made and then the network is gonna be learning that way, right? Because if I teach you to speak in a specific way, then you're gonna be learning that way right from the parent So it depends on whatever you provide as a training samples All right, so this is what the first application of those those RNN. Is it cool, right? All right, let's see some more examples. So let's get back to this guy Wow, so dark and We see the second one. We had inverted the lesson. We had the opposite one We have sequence two vector. So we start with a sequence of symbols and then we output one vector Can anyone think about an application of this stuff? for example or But then maybe stock market you'd like still to have multiple outputs, right? So maybe you want a sequence and then start output in a sequence of future predictions So there would be a sequence to sequence. I would think I would think have you ever use for example in Amazon You can read some reviews or IMDB in order to see the reviews of Movie and you'd like to know always a positive review is a bit negative negative review No So in this case here you can train and you're a net in order to tell you whether a review is positive or is actually a negative Right, so you can understand You can create a network which is understanding English and trying to figure out whether it's a positive Is a positive description or a negative one or another application Is actually having a neural network learning how to? Compile a program Let's see how it works. So you have an input here for example, I start with J equal 8,584 then I have 4x in range 8 J plus 920 then I have B equal some you know additional 1,500 to J and then I print my B plus 700 7,567 and then my target which I compile with you know proper compiler is like 25,011 and I try to train my network after having inserted J equal 858 for a return f or space blast So this is my whole sequence of characters and symbols then I have my output Which is gonna be to 5011 dot So in this case we learn how to execute a program It's crazy. You don't need that even more a compiler You just have a neural network figuring out the correct output of a you know Program or maybe and you can even put like mathematics You can put some latex and it's gonna tell you the output of your computation. So this is crazy I think for example here the same we have an input I whatever C is gonna be a function of I then print some kind of Conditional statement. So it's gonna be even learning condition Relationships and then my target is this guy here and this stuff actually works and it's really impressive We have neural networks which are actually making computations and maybe they are like, you know, if they don't get it right They still get in the right ballpark ballpark. So it's like physicists, you know, you You have good intuition about the order of magnitude of some operations You don't perhaps know exactly the name the number But you know roughly what it is and this guy if when it doesn't work when it doesn't really give you the correct answers If you're something is, you know, you know rough estimate of whatever you are trying to come compute. Yeah So could we have the output of that via just a boolean whether or not the program halts? For example, yes. Yes. So this is what my friend maybe they are doing at Microsoft. They have some Some recurring neural net running over the code and trying to predict for example the how to complete maybe the next The next like a call to a Okay To look at a program and say yes it will halts or no it won't halts because formally that's impossible formally Maybe you can try empirically and see if it works Correctness of the abstract syntax tree maybe Well, I didn't give it a text and text to be easier than giving it raw text, but Either even still, you know, that's that's a formally impossible Thing for an algorithm to do and be interesting if an RNN gets it right like 90% of the time So, you know what? I'm so interested. I'm gonna be looking forward to see your results if you try it without Today So then let's go like so this is second application we have two more applications, you know, are you excited right? It's so cool is that all right? That was sequence to vector and then vector to sequence. What is left? Well, what our your colleague is was suggesting before we have sequence to vector to sequence So in this case here, we are basically getting some input here Then we are condensing it into a Internal representation over here, then we bam expand so we have an encoder in this case Other questions No, no, so there's an encoder here where we have a sequence condensed down to a concept vector And then from the concept vector we have again an expansion again into the sequence domain So we have sequence to vector which is Compressing the knowledge of that sequence and then bang to another sequence and this is crazy because for example, you can go from English to Chinese and Then you actually learn how to summarize English and English sentences with just one concept Then you just remove this one you plug in Italian and these guys that start It's gonna start speaking Italian without having ever had to learn English before so given that you learn a pair English to Chinese. Thank you. You can just switch this to Italian you can switch this to Chinese you then you speak Chinese and English Italian you do all the combination is this stuff is really like Versa time you can swap and change and everything just works It's really good. And this was yeah sequence to vector condensed representation to back to sequence Right. So in this case we have here some examples here. I'm showing you the like to the Dimensionality reduction over the condensed version like the concept Vector and after we collapse that sequence into a vector and then you can see here For example, there are a yellow cluster and there is a blue cluster and a green cluster. Let's zoom in. So the green cluster We're representing all the names of the months. So no one told the network all months are something that are Swappable, but then the the network figure out that every time the sentence is written in a specific way I doesn't matter which kind of month you put there So all the representation of these different months are all close together because if you swap them They don't really change much the structure of the sentence. They change the meaning but they are Interchangeable I would say in the in the sense of like having a proper sentence The other one the other cluster where it was this one. So all the time related Sentences like one to three months two days before for nearly two months So the number two here is recurring over the last two decades of two groups So so everything here that is expressed in some time Laps is collapsed in this kind of same region and the nice part is you can also do the difference of this one You can see how much time is left. So the difference between this and this one is the same as I don't know Maybe this one and this one. So the this the metric in the latent space is also Corresponding to the metric is somehow somehow some kind of metric in the actual initial space, right? so it's really great and Sorry, sorry, and then there is the last application Oops And then the last one which is this one sequence to sequence to start input is something and Then this guy starts outputting something So for example, you may have you watch a movie But you're deaf you cannot hear so you can have a neural network that is why it's listening to the audio It starts splitting English you start reading the conversion of the English or some other example is gonna be your I used to call it T9 if you had a Nokia, you know 20 years ago You had like an auto correction. Well, you have similar things now right now But they work a different way a little bit But here you can just start inputting your text and these guys trust just gonna try to mess with you You know, you write something it's gonna write duck. I didn't want to write duck I was I was saying something else, but again That is more safe word to write whenever you're right not duck So again here as you input some sequence of Symbols it starts outputting immediately some other sequence Which are like the best guess about what is what you're trying to write or it can do even more fun So let's say you are writing a book But you're lazy. You have no idea. You have a very bad imagination. So let's have a recurrent neural net actually suggesting you the plot of your book Let's see so here I'm writing the rings of Saturn glitters with while the Harsh like two men look at each other. Mmm. So excited. They were enemies, but Because I've been training this on the some science fiction, so it likes And so this was just to give you like some motivation about Why should be very very very very interested in playing with these guys because they're absolutely powerful You can do so many things you can even read how people write with calligraphy calligraphy You can even have a network writing with calligraphic way. So I'm inputting texting word I don't like the way it looks. It's very impersonal, right? So I just they say convert and it writes like with perfect way of, you know, writing And it's gonna be all letters different. They never write twice the same letter in the same way So this is really good stuff here. All right, so let's see RNN training. How do we train this stuff? Well with something that is called back propagation through time or BP TT It's kind of the short one. All right, so How does it work this stuff? All right, so let's start again With the vanilla You know, you're not mutilated perception the stuff we have been seen Multiple times so far. So we have the input the hidden representation the why hot my prediction the those matrices in the nonlinearities So again my let's say input belongs to our queue the H the hidden representation to R and the Y to S so again, we have the Hidden representation is simply an affine transformation of the input plus a nonlinearity and the same for the other Right, so it's just repeating again again And this can be thought like if I have my five or Function my neural network function. It goes from my input to an internal representation to my output So what's the difference here? We have to deal now with the we are we what's the difference so far now we have H of T No, they are square brackets. So these are representing sequences No, so here I have my X and my input X or X of T like a sequence of X is Here I have a sequence of internal representation and then here I may consider like a sequence of outputs Or maybe just the last output, you know, I just drop the first output So and these are simply how I start so instead of saying H was like a Fine transformation of my input in this case my H is a fine transformation of the concatenation of the input and the Preview state right so I just get my internal state I get my input I stack them one on top of each other and I do exactly the same So I just reuse this guy here inside One should say oh, how do you start? Well, I set my initial state to be perhaps zero And then here my matrix wh is simply the concatenation of the two matrices the one that is mapping the Internal representation the the input to the internal representation And here the other one that is mapping the internal representation to its own Representation and then I have finally my prediction which is going to be the nonlinearity of a fine transformation Of the hidden state right so nothing new here You can see the equations are basically the same But instead of having just the input X now I have the concatenation of the two and here instead of having my wh My wh where is still wh it is like a concatenation of two guys, right? So this first row It goes like here Like first row goes first bit This row here it gets down to the second bit, right? So when you concatenate it's like having the summation of this one and And so this one times this one plus this one times Right, so the understand what we are doing here Yes, no Yes, I cannot see you. You're dark Those okay You're making noise. All right. So all right. So this is my equation. No one one little change. It's just a concatenation here All right, so how do we train this stuff? well to train this stuff we So there is no magic in this relationship. This is not exactly happening right now This is simply saying that the current input here depends depends on the state at the previous time So if we start somewhere we can enroll multiple times Whereas my input here gets the current input x t and gets its x ht minus one coming this way here And then here I have simply my output with a fine transformation, right? So the only thing here is that instead of having just input coming in at this time I have some input coming from the past And then I have an output which I will provide to the next iteration here in the future here Again, I have my input. I have an output coming from the past then I perform my input my my my my here my path passage here, okay And then at the end I'm going to just try to Enforce this guy here to match my final sequence or my final representation or whatever we are trying to do So no no big mystery in how to train this recording your own lab, although you can see here there is a Connection between the output in the input actually there is no such connection. This is a temporary connection here. There is like a delay Module right so the connection happened from the previous version the previous time step. So it's no it's not a big deal Good so far Yes, all right Let's go on so let's let's actually see an example on how to train this stuff For example, we'd like to model a language. So let's see. Let's say you put some c code You'd like the network to learn how to program in c or you put some English You'd like to the network to figure out what is the structure in the language or whatever So you have a sequence you'd like to learn more about that sequence. So how would you do so if I'm saying Today was a rainy What's the end of my sentence? Perhaps right or today was a rainy day or any morning But you already can figure out what is what are the possible outputs, right? So what are we going to be doing here? It's going to be just training a neural network in order to predict The next symbol in a sequence and this task is going to be basically Enforcing some kind of you know learning to the network. So the network has to actually understand the structure of this kind of sequence All right, so let's see. We have a sequence here Where each of these symbols may represent a symbol So every letter here, sorry every character may represent a symbol They may be letters in a word. They may be word in a sentence. They may be a values in a stock market. They may be you know, you you you name it You name it is okay So first thing we have to bet you five. Why do we have to bet you five this stuff? because If we are working with betches, we can accelerate the processing time Instead of just doing one operation at a time We can perform multiple operation at a time and since you already took some courses here You know that if you perform victorial operations, you know, you can go faster So in here we just split, you know, I have my Sequence a b c b e f here and then I just cut it down and I have the second part here And so on. So in this case, I'm going to be performing operation on a g and m s And I will try to predict b h and n t. So given that I have a I try to do get b G and h try I probably have m and I try to do n and I have the x and try to predict t Okay So this is the batch batch Ification you can find the full example at this link on the bottom All right, so Yeah, so I this is my input I'm going to show you in the next slide I'm going to lose exactly one two three One two three sequence information. Yes I thought one two like, you know Because connection between s and c is broken S and c like in the example is like they are all after each other, right? Yes When you break into batches The connection between batches there that that connection is broken. All right. So let's say this one is long 1000 characters I divided in four different sections. So it's going to be 250 Long strings. So it's fine. You have a few so this you can think about this sequence to be like 10,000 characters You're going to just have 2,500 long sequences. So yes, you're going to be losing some information, but it's uh negligible All right, and yeah, this was the batch size here here. So you decide what is your batch and you stack those many columns So we get first this is going to be my x 1 to t Uh, so I have one two and three in this case, right? This is my temporal batch So t is the temporal chunk Whereas this one is the size of the The batch of so how many computation I do at the same time? So that was my uh, b p t t period t capital t So that's my chunk Temporal size And this is my input x Below I have my output y so given that I provide a G and m s I will try to predict b h and t given that I provide b h and t I will try to predict c i o U and then last one I provide c i o u and then I try to predict predict d j p v all together So here I can capture relationships that are three ways apart all right, so Let's see how it works So here I have my first element a gms I feel it inside my neural net And then I say neural net. Okay. You had to force your output to match These uh sequence here. Okay. It's going to try to do some something Then I'm going to have the second a second sequence, right? I feel it to the network But first I'm going to get The hidden representation from the first guy I feel those two guys to the second replica of the same network, right? So there is the same network just a time step later And then here I'm going to try to force a net just try to predict c o Yeah, so I don't know whatever Then again, I have this third sequence or whatever the capital t the the temporal chunk the temporal back propagation Temporal interval I have this input. I get here the representation of whatever was my system in and then I feel inside the neural network and I enforce here My output to be the last one By performing this kind of training the network's going to be Refining its own capabilities in predicting the next Sequence of symbols and in this way the network will learn a representation of the language So uh in here we have a final ht But I have a bar here So before when we were training this guy here I when I fix these guys here, I'm going to have some signal coming down here, which is my gradient So my signal comes all against the arrow's direction So when I force a loss on this guy here, I have one arrow that goes down here One arrow goes down this direction down here an arrow goes down here and follows this direction here Then I have one vertical guy here and this guy goes down here and then I have vertical guy here So you can think oh, why okay? I just put all my sequences then you're going to have memory overflow Because you every time you add things these things just increasing the the magnitude of your Computational graph. So sometimes after a bit you're going to be running out of memory So at the end here, I have to say oh Here stop the gradient here. I don't want to track anymore. What happens next So just store the value. Let's stop the gradient. I don't want having don't want to have any kind of gradient coming from the future the same way here, I will also have a initialization of the initial State which may be just zero because we just started and also here. We don't send any gradient to the past We have to decide to chop off a fraction of my data. Otherwise this stuff becomes, you know, very large And this is how you train a Recurring neural network for language modeling. So you just learn how to express a sequence in this way It's a good so far. Did you try did you follow? More or less Yes, make yes with your head. Very good. All right So in the last part here, and then we are going to be having some exercises And found with the notebook so that you can actually have hands-on experience And and you know, you're gonna maybe able to recycle those Those those those code for your own application later on with the notebooks So before going to the notebooks, we are going to be seeing One more architecture of those Recurring neural network which are trying to solve some problems So we are going to be talking now about long short term memory which is a specific way of implementing a Gated Recurring neural network. So what I'm talking about here. So sometimes you're trying to get some Let's say information here perhaps from You'd like to find out something that happens between this symbol here and this symbol over here So there are so many symbols happening in before And so there is some Possibility that the network you put a okay. That was a meaningful symbol Then you see so many things. He actually forgets about a when you arrive at the z here Ha, did I see an a I don't remember. I saw so many other things And I always have to keep in mind what happens before because I I have an internal state Which is dependent on the previous input. So it's it can be actually is very likely that after some amount of, you know Length of your sequence you start losing Memory about what happened really, you know, did you remember exactly the first slide I show you here the first day? Okay, because it was so pretty right, but that's good Good, but you know, maybe you start forgetting things you learn in undergraduate I know that's what I try to do. So it sticks in your mind Um Yeah, but actually that is the point. I'm trying to you know build some long-term memory in your mind It's I mean, it's pedagogical stuff All right, so the point here is that if you just use this kind of very easier Version of this recording you'll let you're gonna be losing some track of, you know, long-term dependencies in your data So someone has to do some, you know Come up with some other kind of Memorizing things and in specific case, we're gonna have just a switch which is gonna be oh just remember about that And keep it apart So we are gonna be introducing those switches or gates and these gates are letting me store information For long-term retrieval retrieval. I don't know how to say retrieval, I think So you can just store in little pockets You just say okay, I store this for later use reuse it Like whenever you have to do an exam you copy those formulas on your hands No for later uses in in the exam. So you don't forget whatever you were trying to do All right, but you don't want to cheat, right? All right, so long-term short memory or LSTM Let's see how it works. So that was my question for the Vanilla the the plain recurrent neural network. We had the The The out the hidden representation is a fine transformation of the concatenation of the input and the previous hidden representation And then the outputs the same stuff have seen all the time the fine transformation of the hidden representation with the null linearity So a nice drawing of these equations is the following So here you can see my input and the concatenated previous state goes inside a Th which may be like a hyperbolic tangent And then I have my for for the null linearity and then I have my ht in my current state Then I have the ht fed inside another hyperbolic tangent and then I have my output You know through the affine transformation in between with the arrows. Okay, so no no no There are no news here, right and Is it clear right? So the only difference between this one in a normal neural neural net is that you don't have this Concatenation part everything is just the same Is it okay, right? You understand this diagram Because I'm gonna show you something a bit more scary right now. Don't be afraid. Don't scream Uh, don't run away. Just stick with me because I'm gonna try to explain This thing Okay, don't don't don't even try to read it's unnecessary. I just put it there for reference I just wanted to scare you a little bit All right, so this is diagram. Okay, the diagram is also a bit complicated But it can be explained with colors because I like colors so Let's see how it works. Okay Uh, it's but again those balls here they represent null linearities and the arrows represent affine transformation And this big ball dot is the element wise multiplication So we have element wise multiplication summation over here And then we have no linearities around here. Okay, but again the input is going to be the concatenation of my input and the hidden state My output here is my I have my state current state. I have some c which is a Memory cell and on the other side here. I have my previous c. So I have a cell memory cell which is Different from my internal state. So I have an additional memory cell which I'm going to be using Okay, so far So very similar to this one Similar similar balls with different null linearities. These are hyperbolic tangents that they go from minus one to plus one These are sigmoid going from zero to one And so these are my switches because zero means I turn off the switch one is on the switch This guy instead go from minus one to one plus one to represent the full Thanks, whatever the full representation. Okay, so far. Is it clear the what those Singles represent in this diagram Yeah, you should repeat that again. Yes, of course. So where you find here? Yeah, yeah, I was fine there. So this one our switch part. Yeah, the balls represent non-linearities TH says for hyperbolic tangent, but it was not able to write. I was not able to write it all in the ball Um, and the lines represent affine transformations So the same happens here Balls and lines the same meaning we have also element wise multiplication With the big dot and then we have some missions So we have two more operator And the other one is the difference that I have a sigmoid here and the sigmoid goes from zero to one Whereas the hyperbolic tangent goes from minus one to plus one. It's like a rescale Uh, hyperbolic tangent basically no other differences Same kind of input concatenation of input and previous state Concatenation of input and previous state The output is my h here. I have my h here and then there is an additional wire associated associated to my Memory internal memory, which is where I store my information. It's like a secret Box where I put my nice things. Okay, so far just this I want to go to understand There is a very scary equation here sequence of the question. I don't care you read and here. I'm just showing you Some additional blocks non-linearities affine transformation multiplication addition Right good so far everyone So the sigmoid is kind of like a switch where it can be zero and one Uh, sigmoid it goes is the classical logistic sigmoid. It's like the tan h Where the minus one part is actually at zero. So from zero goes to 0.5 in zero and then Plus one, right? I just mean like a signals processing like look at it Like if we were to do this in terms of like an electrical Like a diagram It could be 0.5 There you are, can you see? No All right, it just uh, it goes from zero To plus one whenever it's minus below minus five is zero Above plus five is one and in between goes like linear and in the center is 0.5 It's like a it's a known a non-linear function very it's basically the same as a Bless you hyperbolic tangent where the minus part minus one part goes to zero All right, so yes Is the dimensionality of c is that it could be anything? I mean, it doesn't have to be same as the number of H It's the same number of h because I'm going to be so h is just the Dot the element wise multiplication between those two guys. So yeah, yeah c and h have the same size But again details not not interesting nothing not It doesn't really matter right now. All right, so We have first part here where we have a input For our What do you want to remember? So we have an input module of our system the input module deals with the Uh, I'm gonna get some more information in my memory. I don't get new information in my memory Then I have a don't forget block If you read any article or any blog post or any talk or anything You're gonna hear about all the forget module. No, that's the non-forget module. They got wrong the Logic the boolean logic is opposites. Is that what's upside down anyhow non-forget block So this is the block that decides when you don't want to forget information And everyone is going to call it forget information. So that's why it's called f for forget But this is actually a bar on top not forget, right? This i'm so annoyed Yeah, there is another there are other issues with this field. They get got wrong other things But anyhow, that's the non-forget thing which is basically the block associated to the information You don't want to forget right? So this is what you want to take note about and the last part is the output block The output block what decides what information you'd like to take out from your memory and provide it to the output So the yellow block the input block besides what you would what you want to store from the input sequence in your internal memory the purple Whatever pink block is what decides if you would like to keep in memory or forget about whatever you are Have input before in the memory and then the green or the blue one is deciding whatever you'd like to provide to the output Finally there is some similarity to the initial guide there where we are creating our prototype for our memory and then there is a final part for the the output there which Corresponds to namely to this part but just to show you the similarities of the two equations again Don't pay attention to those equations. Just check about the diagram input Don't forget output. Okay. Yes. There has to be a y output. I don't see it here My output is going to be h in this case So my h is my current output the notation is this one So, yeah, there is no such block here My h is considered output if you'd like, okay, if you'd like a final output you can add here a additional linear Like additional block Like this one You can have it at the end, but usually what is down here you Stack multiple of these blocks together and then only at the end you're going to have the final output You know conversion all right so far Not questions related to the equation but questions related to the three main blocks here. No questions All right, so let's see how they work. Are you excited? Oh my god, seriously, that was your excitement. All right Oh, okay. Listen. Good. All right. So let's see how it works Controlling the output Let's say I'd like to turn off my current output So whatever happens in the input whatever is in my memory. I just like ah, I don't give you anything I just give you zero. I don't want to give you any information. So how we turn off the output So let's say My sigmoid are saturated. So we are working in the plus five And minus five Interval right above plus five. We said it's plus one the sigmoid and below minus five is going to be Zero right? So we are in the saturated regions For sake of explanation The network is going to be learning automatically how to bring its values in those ranges So again green is going to be my one value for the sigmoid red is going to be uh-uh zero All right, so let's say I have a internal representation my basically final Uh Whatever the color there the purple there. That's my internal representation my internal information Uh, I don't want to give it to you. I'm greedy. I say all for me my treasure my how is it is it treasure, right? My precious right? That's my All right, so I say uh-uh red flag on the sigmoid there So given that I perform a element wise multiplication I just write zeros on that kind of multiplication and I give you nothing red. Okay, so If I turn off my sigmoid then I won't be letting you Have any kind of information On on the contrary instead if I'm gentle and I'd like to Give you something I'm going to be setting up some green In that sigmoid there and then of course the multiplication of one with my Internal representation is going to just give you my output again. These I'm just talking about one unit of my Uh elements here. These are vectors, right? So they are actually multiple values and this one can be performed for each of those values right now I'm just telling about what may happen for one value, right? Then everything is independently Done for all the other values. So they are not scalars. They are vectors But we are just taking consideration one of the Specific element, right? So again, if I have the output sigmoid, so if my output sigmoid here It gives me a zero then nothing comes to you because of this equation here by the element wise multiplication If my sigmoid gives you a one then you actually get the Hyperbolic tangent of my memory set specific element, right? Because this is element wise, right? So you can choose which element I give you Okay, so far All right, so here we go to the next gate. So this is the output gate Are you ready next one? Yes All right, tada, there's so much colors All right, so this is controlling the memory Uh, we can perform three different operations with the memory block and the memory block Uh, in this case is like the it works in tandem the input and the Don't forget part. So they don't forget and the input blocks they Co-work in order to perform three different operations So let's start with reset. I'd like to reset my memory. I learned so much in in those, you know Economy economics classes. I don't even more care about now anymore So I would like to erase all the things I learned. There's like unnecessary. I don't know whatever Uh, so how do we erase memory to erase memory? Let's say I have something in my memory I I have some ideas of what I was doing there Which is my purple color over there Then I have some initial Value here of my memory. So Here I have my memory From the previous state And sorry, this is my actual current input. So someone is giving me some input Yeah, so I'm actually someone is talking I'm talking to you. I'm telling you many things and you're like, oh my god So many things to put in my my head. I won't be remembering So I had to remove things I had stored previously, right? So this is your previous content you had in your mind the blue one So I'm gonna put zero from the input. I don't listen. I don't listen I can't listen because I had to match too many things in my head and therefore nothing comes through, right? So please Don't stop listening Because otherwise you don't follow anymore what I'm saying on the other side. I had something in my mind I was thinking I cannot think about other other things now because I'm you know getting overwhelmed So I actually had to clear my mind. So I say ah read in my Don't forget. So that means I would like to forget actually And therefore I write zero in that multiplication. Guess what? I sum zeros I get zero And I erase my memory, right? Because now my output here it gets a zero two So in this case as if I put a red sigmoid for my input sigmoid and a red sigmoid Here for my uh, not forget or gate whatever then I get basically a Erase of the memory Okay, so far. So this is how you erase memory. Both of them have to be zero All right, so let's do keep Uh, I have some input here. I'm listening to something But I don't want to get new information in so I write a zero in this input gate So then the multiplication there it stays to zero on the other side In my don't forget. I actually have a green light Which means I'd like to actually keep in mind what I was thinking about There for the multiplication of my memory by one stays Blue and a summation of the zero and the blue it gets blue so that the blue is preserved for the next step So in this way, uh, I can preserve My previous memory for next iterations in this case. For example before I I heard something. I want to keep it in mind So I can keep it in mind this way finally You're actually interested in what I'm saying today to you. So I was like, ah, let's memorize this Let's see how we can memorize new information So we have a green light for my sigmoid in the input. Uh, therefore we are going to be having a Purple multiplication, right? We have purple times one gets purple On the other side, you have something in your mind. No again those lectures in economics and whatever micro economics I don't know anymore. What it is. I don't want to remember. It's bullshit. I whatever Um, I just put a zero there in my in my sigmoid Multiplication gets raising the content of my memory the summation gets through my new input I'm talking to you new information And then finally I'm going to be able to have new information in my cell So right now you can see now this mechanism which I'm repeating and repeating over and over I am arbitrary deciding whether to store new information to reset information to keep my information And whether to output or not some information you store before So right now we have a gating mechanism in order to decide whether to perform one of those four actions Which lets you remember things forever as long as you don't reset or you add some new information You can preserve that information in your memory without having any To worry about any kind of you know forgetting things because it's like all in your working memory. So while the classical classic Recurring neural network may forget things because you always see new stuff new stuff new stuff that oh my god too much I'm overwhelmed in this case here. We actually have boxes where we put information Whenever we care and we can actually empty those boxes whenever we don't care anymore right, so that was basically It about these lstm So long short term memory allows you to store information for later retrieval so If there are no other questions because no one is rising hands Would you like to actually start doing some practice about You know playing with those recurring neural net and see some exercise and then see some You know meat on the grill. No, I mean sorry for the vegetarian. I mean I would like to Sorry, that's so bad in Italy. We say that would you like to get get your hands dirty and and try to figure out Right. Are you interested? Are you excited? Yes, all right, so thank you for listening for this first part and let's get started with the notebooks Yay My computer is dead. Hello No, okay Now Microphone repository Write this command You're gonna fetch the latest this notebook from the uh, github. Okay, so get rested reset That's that's hard region master. All right. So, yeah, um The repository gets updated every day. So you're gonna need to Run this common to get the latest. We didn't want to put everything at once Uh, yeah, so just a quick recap as, um When we were talking about the convolutional neural nets and we we demonstrated that it's the best type of neural network for image processing tasks because it naturally Takes advantage of the locality of data And the same way you could in principle apply one deconvolutional neural net to process sequential data, but uh, you couldn't uh, you wouldn't be able to memorize Long-term dependencies in your data say for instance you doing a time series analysis And uh, you'll look back. Uh, basically the number of steps to the path which you care about is is rather long The convolutional neural net would only be able to memorize the size of the the kernel that you have Um, and the the simpler kernel that would have trouble learning long sequences because The back props through time If the unrolling lens is long enough, uh, the gradient would vanish or explode. That's uh, thus, um, basically Making learning more difficult and here. Uh, yeah, we will look at some challenges of training on Basically a sequence data So here, uh, the notebook, uh, zero one time. It's actually not zero one in your So let's first of all, uh, actually start in the folder raw Yeah, so what you're gonna see, uh After you run the git reset comment is, uh, this list of Notebook files. We're gonna go to the raw And start with keras sequences So first we will understand the problem implemented in keras and then try to port and that will be partially take home assignment for you to fully convert it to by torch And, uh, so let's open, uh, notebook zero one Right here and also we're gonna need to look a little bit at the sequence task tasks.py is which provides the Some classes which I used to generate sequences for for this assignment And we're going to look at the original experiment from, uh, court writer and smithuber the authors of the LSTM paper, uh, who basically generated, uh sequences of variable lengths That's good. Not too dark Sequences of variable lengths Here we are basically going to be switching between different regimes Generating sequences of lengths 7 to 9 also 10 to 11 and maybe longer and, uh, they these sequences had the fixed starting and ending character the trigger characters and In between they would be consistent of pretty fine set of characters sampled at random And those are lowercase a b c d and also some of the characters would be substituted by Let's special characters, which in our case x and y and, uh In that way we would, uh, basically Defines but we would also define certain rules for, uh, sequence classification in this case If we see, uh, two special characters being x and x we classify this sequence as q if it's x and y We classify it as r y s s y y u So let me show, uh, you here I'm going to quickly change the kernel And We're going to start with an easy regime Here, uh, as you after school if you come back to the notebook, uh, here Basically here we generate the data and these lines which is the difficulty level and the difficulty level is essentially the lens of the sequences In this case, it's the lens of the sequences And here's how our data looks. Uh, so as I said, we'll have The character sequences of lens from seven to nine with the fixed starting and ending trigger characters and In between there will be a b c m d lowercase sample of the random with two special characters insert and So of course, uh If our goal was initially to just classify those sequences as having x x x y x or y y We could just do it by hand with no learning But here we would like to demonstrate that, uh, lstm is actually And our name is actually able capable of learning those rules which are defined, um by hand and as Yeah, so since we are doing here a sequence classification So it's a many to one task or sequence to vector We will encode the we will actually want to encode here every character and also for the label we will encode the labels So for the character, it will be encoded in the nine dimensional space and for label it's a four dimensional Vector because we have only four possible outcomes So once once that is done, uh, this is how our data looks in the vectorized format Which is essentially just one code encoded format The last step before we can, um proceed to training here is batch it so we need to batch our data and for lstm or rnn The tensors, uh, which we need to prepare will typically have three dimensions First is going to be batch size. Second is going to be a temporal dimension time sequence lens Uh, and the last one would be the feature dimension. It's, uh, if you have a for instance, if you analyze an Scalar time series the feature dimension will be one if it's a vector time series like every at every time step you have a two dimensional vector the feature dimension will be two and Yeah, so N9 is the number of time steps, uh, that you actually it doesn't have to be the full lens of your time series, but it's the it's the sub-sample of the whole sequence which you feed at once in one iteration to the, uh, rnn Okay, so yeah, uh, let's spend some time to get a feeling, um, about how our data looks. It's it's very important to know Uh, the dimensions of your data And the next step, uh Since we don't have much time i'm going to try to go quickly Uh, so we will skip that. Uh, this is something we will use for the next exercise So i'm going to close it for now and proceed straight to one one So as I said, uh We will be using keras in this first, uh, approximation and the simple rnn We will generate some data Which is going to be the sequences, uh generated according to the rules introduced, uh in the previous notebook And i'm assuming Well, based on the poll yesterday at least 50 of the people are familiar with keras. So, uh, here We will see a good old sequential model keras Which consists of a simple rnn followed by the dance layer So we will use rnn to encode, uh, our sequences and then, uh, they will be classified By the dance layer followed by the with the softmax activation So softmax will output the probability distribution And that will get the maximum r max and extra Decide what type of sequence that is So here important point to Emphasize is the number of hidden units It's probably the most important parameter to fix when you decide on your simple, uh, rnn or lstm Uh, those can be the final numbers Could be decided based on like empirical trials like hyperparameter optimization But typically that number would be less than the sequence than the input sequence lengths So our sequence lengths would be seven to nine And, uh, the number of units which we start with is four. It will be typically less So another thing is that as you've seen our sequences are actually not fixed lengths For deep learning frameworks, which use a static graph like tensorflow and keras It's a bit of a problem. So we have to do something and that's something is padded. We pad to the max length Uh, typically, uh, we padded zero padded to the max lengths And that's uh, what's if you look if you want to go into details, I'm kind of going fast in the interest of time There will be a place Where those sequences are padded Uh, after the padding, they will all have the fixed, uh, lens which is equal to the max lens In pytorch, however, who uses dynamic graph, we actually don't have to do the padding and you will see it later Which makes it quite attractive for sequence learning tasks And the number, uh, the number of units hidden units in the dense layer would be equal to number of classes for the Number of possible sequence types All right And let's train it Here the batch size we use is 32 number of hidden units is four and maximum number of epochs is 30 Let's evaluate all of them And we're good to go. So let's see how it Goes as you can see the validation accuracy. That's our training accuracy improves Uh, and it quickly reaches, uh, 100 percent um, and uh validation or the out of sample accuracy is also 100 percent which shows that the generalization is good Uh, we can uh, I'm assuming you will play around more with the notebooks, uh, of uh offline but, uh basically The first place to go here is to change the difficulty And try to generate, uh, hard, uh data In this case the sequences would be longer And this is the regime where we would expect the simple RNM to have trouble Because for short sequences like 7, 9, uh, DPT, the unrolling net 7, 9, DPT would have no problem We would, the gradients wouldn't vanish and we would learn, we would learn, uh, everything that we have to learn But, uh, with heart you can see that, uh, RNM is actually, uh, simple RNM is actually in trouble It cannot learn, uh, it only, uh, stops at 25 percent accuracy and, uh Please, uh, try, uh, running with LSTM So basically, uh, you would, uh, after the class you would After the school you would, uh, change the simple RNM to LSTM and see how the result changes We can do that here, right? It's one line It's one line, but we kind of out of time, but, okay, let's, let's just try, so Yeah, so we, since the frenzy is let's just change it Okay, so The training will be a little bit slower There's still four units, maybe it's too little Okay, yeah, that's, uh, yeah, so the simple change to LSTM, uh, will not help, but it should So typically, uh, okay, so basically So the problem here is that the hard task we are setting there is very, very, very hard So even maybe the, uh, actually the, this dimension of the, uh, LSTM we are using right now It's too tiny in order to capture those, uh, very long-term dependencies So we have to increase like the size of the internal representation and it takes forever to train That's why we usually need, uh, GPUs, uh, but basically with medium We may be able to see the difference between, uh, using just simple, uh, Recording your networks and the LSTM So one other thing here is, uh, for the, uh, for the, uh, sequences as long as we have in the hard representation We, uh, wouldn't want to feed it all at once So right now what we do is we take, uh, the sequences, we add it to the max and we feed it all at once Um, this is not the ideal kind of scenario to learn the long sequences We want to, uh, feed it in sub sequences, uh, this is something we will do for the echo exercise We define the truncated lens of the sequence and we feed it, uh, in smaller sub sequences, maybe of lens 20 and, uh, the key thing here to keep in mind is that we don't, uh Reset states, uh, as Alfredo was saying we have, uh, in LSTM For simpler name actually it shouldn't matter, but for LSTM, uh, we have the input, uh, forget gate, uh, or non-forget gate, uh, and output gate, uh, which basically uses, uh, we can think of it as using sigmoid, uh, for the input and, uh, forget gate to Decide which member, uh, which inputs to add And which memories to flush and then times to decide which, uh, components to output And, uh, we won't, uh, reset the states, uh, of the cell Between, uh, those sub sequences, but we will reset at the end of the So just here, let's try it with, uh, medium just real quick and see if it does better with a simple person Yeah, I was, I was an actual connoisseur Yeah, I think for, for this case, even with moderate, with moderate we have to feed in sub sequences, uh, as well Alright, uh, so let's, uh In the interest of time, let's proceed with the next exercise and, okay, so for the next exercise, uh, We will, uh, look at the sequence equine So basically we will have, uh, a sequence consistent of zeros and ones, uh, of, uh, total lengths of 20,000 And, uh, basically, uh, our task is to output, uh, basically a sequence Delay by a certain, um, time interval Here, uh, the 20, uh, the, with the sequence lengths as long as 20,000 and actually even with, uh, far shorter Um, uh, sequences, uh Simple RNNs or even LSTMs would, uh, have, um Problem, uh, that's why, uh, basically here we define the truncated lengths And, uh, this is what, uh, is called the stateful training in, um, Keras Actually, yeah, that's what used to be done in the previous exercise Which means that, uh, given, which means that given, um, let's say few long sequences The first thing we do is we add it to the max To the max lengths Uh, we don't feed it all at once And here, uh, we will be batching Different sequences. It's important to not batch, uh, different subsets of the single sequence in the same batch because, uh they have, uh Temporal, um, information that needs to know. So, uh, one sequence should stay as one Um, and we will split it into subsequences like that. It's better if it's a Divider without, uh It's very, if it's a multiple of the total lengths, uh, so say, uh, the total length has, uh, Is 100 time steps, we would choose a subsequence length of 25 And we would have, uh, four subsequences Important point here is, uh, let's say, uh, our next batch. So what we would feed to, uh, The neural network is, uh, that tensor From here to here and the next, on the next step, we would feed this tensor But we wouldn't reset states, uh, in between. Here we wouldn't reset states in between and here we wouldn't, uh, reset states, uh, as well Um, and then as you feed the next batch here, we would reset states here reset, uh states of, uh So yeah, so this is a different task is, this is a many, many task And let's, uh, before we start, let's look at, uh, how our data looks So here you, you see, uh, basically, uh, sequence of zeros and ones It's a variable sequence, a total length of 20 000 And it's, uh, actually the output sequence is the same as the input sequence You can see that zero, one, one, one, zero, one starting with some, uh, step Zero, one, one, one, zero, one, and so on, it's the same sequence So next we batch it, uh, so the raw input, uh, shape is five to 20 000 And the batched input shape is, uh, again, as in the previous case, it's a batch size sequence lens In this case, truncated sequence lens and the feature dimension Since this is a, uh, one-dimensional Scalar series, uh, the feature dimension is going to be one And the time steps is the truncated lens The 10 is a truncated lens right here Is as much we're going to be feeding to, uh, to RNN at once So once we are, uh, familiar, familiarized ourselves with the, um, data Let's proceed with to the notebook one two and, uh, here we generate data First we define a model here, a sequential model consisting of a simple RNN Followed by the time distributed dense layer So, uh, the time distributed dense layer is something that, uh, Define the layer wrapper defined in Keras It will basically, uh, in the end of the day, it will apply a dense transformation With the sigmoid activation in one hidden unit to every temporal slice To every temporal slice, uh, we will return sequences So, uh, as you know, as you, um, Alfredo was explaining, we have, uh, the LSTM cell or a simple RNN will output, uh, a sequence And for classification tasks, we typically only take the last state The very last state of that sequence For, uh, sequence to sequence learning, we would like to retain the whole sequence And we can't apply the dense, uh, fully connected layer to the whole sequence Uh, so, uh, we have to apply it to every, uh, temporal slice Uh, the stateful term here, uh, means exactly Exactly this And the statefulness is actually a key to fix the first exercise, uh, for the case of longer sequences Uh, we would feed it in some sequences of the total sequence without resetting states in between Echo step here is, uh, basically by how much we, uh, delay the output sequence Uh, as you remember, uh, when we were looking at the data, which is the sequence of zeros and ones The output sequence will be the echo Mean identically the same sequence as the input but delayed by a certain number of steps and the echo step, uh, defines that delay This means the network has to actually learn a way of representing and keeping memory Whatever it has seen for the previous, I don't know, in this case, three steps So that is able to reproduce the sequence after a bit, right? So if you, uh, elongate that echo step to 10, uh, units the, uh, Recording your network has to figure out a way of, uh, having a compact representation of 10 symbols In order to then start speeding out that same sequence, you know, after 10, uh, Intervise, right? So we have, again, a compression of a sequence into an internal representation Which keeps being updated over and over the time and then it starts re-expanding that condense representation into a sequence again So as you can see, uh, with this setup, which is an easy setup again, uh, Even simply the current current lab is able to learn that, uh, representation We get 92 something percent on the train set and, uh, the 100 percent on the test set Yeah, there's few ways to make it, uh, more difficult is first of all, uh, for instance, increasing truncated lens So strictly speaking, one doesn't have to do that. But just to demonstrate basically the truncated lens, uh, controls the size of this block, the size of the subsequences which Are fed, uh, at once to the RNN or LSTM 50 might not be enough Yeah, so the 50 is good enough Let's try to, uh, increase the eco step Well, definitely if the eco step is larger than truncated lens, uh So, uh, so right now by choosing a eco step, which is way larger than the truncating length Then the network best guess is going to just to be, uh, randomly choosing a zero or a one as we as we have seen As we have seen those sequences are made of binary, uh, digits and therefore the best guess if you don't know It's just to have a random number and given that you have a 50 percent of chance of having a five a zero or a one Then the overall accuracy is going to be just 50 percent. You just get half of the time So if you just output all ones, then you get half of the, uh, digits, correct? So instead of having like eco step of three, we can go, uh, to six. Maybe we're gonna see, uh, before we were able to, uh, correctly You know, memorize those kind of sequences because, you know, how many, hold on. So even hidden units of four, uh, how many How, what is the length Of, uh, steps you can memorize. So with four, uh, what is two to the power of four? What is four to the power of two? Is it the same, right? 16 Okay, it's a joke. Don't All right, never mind. So you can, uh, up to 16 four units will be able to memorize up to 16, uh, elements, right? So anything that is below 16, I expect to achieve a perhaps, uh, accuracy of 100 percent But again, as we have seen now from going from three to six, actually this network doesn't, uh, manage to achieve the 100 percent. Uh, if you're switching now to LSTM instead, we're gonna be able to see that it's very, uh, easy to have long-term dependencies, even, you know, really, really long, uh, like up to the, um, even 20, you know, if you have the, uh, what the, the truncated, the best truncated length of 20 So right now we switch to the LSTM and we're gonna see, uh, what was before 86 percent, right? The accuracy Uh, let's see now what it gets. We are 57 Before it was three, uh, truncated length, uh, three eco steps. So it was 100. Then we switched to six. It went down to 87 Uh, now we switch to LSTM With six and you should perform better than the RNN Let's see To be fair, uh, the LSTM is using four times more parameters than the RNN I was going to say that LSTM has a lot more parameters than the simple RNN So, uh, it's not quite a fair comparison to to train it for, uh, the same number of epochs So, uh, run for 10 epochs But yeah, so the main, um Point here is We don't feed the whole sequences at once But we split it into sub sequences of truncated lengths And we feed it, uh, sequentially Without resetting the states, uh, of the, uh, of the LSTM in between. In fact, for simple RNN, it there's no inner cell, so the There's no carried state in the memory which it memorizes. So that approach, uh, wouldn't work as successful, although Reducing the, uh, truncated lengths would give an advantage And the last thing, uh, while it's running, uh, the last thing, uh, we'll try to go over, um, a little quicker than We should, uh, really, is how to re-implement that all in PyTorch And, uh, part of that will be given as a take home exercise for you to play with Um, one good, uh, thing About PyTorch is that it has dynamic graph, which means we won't have to, uh, actually do, uh, Padding of sequences and we won't have to do use time-distributed Uh, layer It's improving, but very slowly Uh, one other thing we could try is, of course, changing the learning rate But while we're gonna go with PyTorch All right, so the PyTorch exercise is in the main, uh, section here We're not going to attach the 08-1 and 08-2 because this is just, uh, describes the data In one case is the, um, manually generated sequences with a certain rule encoded, uh, like the XXR, XYQ, uh, and so on And we're gonna jump straight to the temporal, um, sequence here Um So we start by importing the data. The data generator is not changed here And the difficulty is the same The neural network is, again, defined, uh, by a class direct from an n-module here With a constructor, um Within the constructor we define, uh, the logic of our network which consists of the RNN And that followed by the linear layer Uh, so the RNN will take, uh, three parameters here The input size, which is the lens of the sequence, uh, in this case The RNN hidden size, which we fix manually, uh, to be four in the beginning The number of layers of the neural network, uh, will be, uh, one In this case, and we can specify the non-linearity use in the RNN Uh, if you want consistencies, consistency of the nice dimensions, uh, with Keras You use, uh, batch-first-true, which would put the batch dimension as a first So you will have batch size, sequence lens, and, uh, feature dimension Otherwise, by default, batch size will be, uh, second dimension Uh, and the linear is just a linear, defined layer, uh, from the hidden size to the output size Uh, yeah, and we use a softmax output, uh, activation Potentially, you can also use the, uh, NN.sequential, uh, constructor So, the same way we have seen before with Keras, uh, can also be done with PyTorch Whenever you have, like, a simple sequence of, uh, blocks, right? So that is the more versatile way of creating a network, a model But otherwise, you can simply use the pre-made, uh, NN.sequential Uh, so one quick look at the training loop There is no, uh, need to define a training loop in Keras because we use estimator And all we do is we pass the parameters and call, uh, dot-fit Here, uh, you will see, uh, the, like, a usual structure of roll training loops, uh, in PyTorch Model training will simply, uh, set certain parameters for the layers, which are dependent Which are different between training and test phase, like, for instance, dropout and batch Nermanization, uh, then, uh, we see, uh, the regular batch loop over the training set You can use, uh, for instance, if, if your data comes as a generator, you can use, uh, enumerate And, uh, take the data and target directly in the for loop here, I'm doing it with a range Uh, finally, since our data comes initially as a set of NumPy arrays, uh, we have to Convert to Torch tensors, uh, using from NumPy function and then specify the, uh, precision of the, of the tensor Um, two device will basically place it to, uh, CUDA to GPU, um, device, if you have one Finally, uh, optimize the zero graph every step because we accumulate, uh, gradients, uh, for each way Every, uh, mini batch duration Uh, this is the forward pass which will return us a prediction, uh, in case of RNN, uh, it can also return a hidden state If we ask, uh, as you will see in the next step, uh, the loss, uh, this is the loss calculation in the Backward, uh, we'll do the back drop and finally the weight update, uh, with the, um Yeah, we, we, we publish it. What we can do is, because I want to make sure that we get to the next piece Which includes a survey and the reimbursement stuff, but anybody who doesn't need to leave immediately I don't know what time you guys have to take off because, um We can switch back to this I need like 20 minutes Oh, so I thought he would finish I'm kind of hungry, right? I haven't had enough food so far, but there are sort of cookies out here, so But at Atlanta, I really should start the next step. So what do you want to do? You want to break for cookies for 10 minutes? Or continue for here for another 10-15 minutes? Let's continue. We'll kind of leave it Yeah, we can continue Thanks They like him All right, awesome, uh, one thing here is, um, so these Parts all the time will also look almost exactly the same For all models, uh, there's some difference will be, uh, for instance for r and n when you return the hidden state I will show you in the next example. Uh, here is the accuracy calculation So in keras, you don't you actually have to specify How you calculate the accuracy you just have to say metrics equals accuracy and it will decide based on the Output activation how it should calculate accuracy for you So basically by torch in this sense is a lot lower level because you actually have to think what you're doing In this case, uh, the output activation is a soft max Means it will output the vector of the floating point vector of probabilities and then to decide actually Well, we're doing sequence classification here to decide which class the sequence belongs to we have to take the Identify the index corresponding to the maximum probability And here we do exactly the same except, uh, the torch max it will return both an index and we can ask to Retain the value by keeping true and then we pick Finally we check, uh, where the target the truce value corresponds to the predicted value by red dot the q-target Uh, you as spread, uh, basically The the dimension of the target will be the batch size time the feature dimension, right? So in this case, it's going to be 32 and 4 the batch size is 32 and 4 is the number of possible Output classes, um, and the, um, prediction will have dimension 32 1 4 uh, so So we will have to reshape it to, uh, Mesh is mentioned for the target before compared after that we sum it, uh, some of throughout the batch and The item to get the scalar, uh, from the torch tensor So the nothing special in the test Only thing that remember to call model eval in the beginning of the test loop it will basically, uh Um For layers like dropout and batch normalization those are not used in the test phase. So it will disable those Uh, finally use torsional graph in the torch version all four and later. Uh, this is uh, basically courses graph not to be created for the for the inference, uh, or prediction stage which saves time in the earlier versions, uh We would use like volatile variables and other stuff Okay, so I actually already, uh, so that that part, uh, will Change here. I'm constructing the model Defining the loss which is cross a triple loss because we have multi class, uh, classification here Erma's proc you can use anything else. There is Adam Everything looks the same Uh, and here would be our epochs loop where we would call the train method Every epoch and actually we would do the testing only once in the end of the number of max epochs, but typically the way, uh we do The best approach would be to run the validation. Um step after each epoch, which we don't do in this case And finally the accuracy is just a simple, uh, division of the number the ratio of the number of total correct A to the train size and we want it to be a flow so I'm custom All right here. We also got almost 96 percent That took almost 200 remember for the easy setting in, uh, keras. We got, uh, 100 percent very easily And I'm going to leave it to you as an exercise to find what is actually Not allowing it to, uh, get a hundred percent accuracy and If you find, uh, please let us know and Alfredo will tweet, uh, about the winner Send a pull request, okay And 30 of course send a pull request and so Finally The last notebook is 084 equal experiments. Uh, so here, uh, basically All the same steps we get the data, uh, which is defined in the sequence data.ty Then we define a model which in this case, uh, the minor difference will be We're still going to start with the simple r and n um, and then in the course of exercise you can, um Play around with lstm or maybe gru as well It will contain consists of only one, uh, r and n layer and followed by the linear layer with the Uh, sigmoid output activation. This is a, uh, in this case, it's a binary classification task So we use, uh, sigmoid output activation And we will compare the output value whether it's closer to one or close to zero Um, the only distinctive thing here since we are doing stateful training remember In case we said we want to do stateful training means we don't want to feed, uh, the whole sequence all in months But we want to split it into sub sequences and feed it Uh, sequential without resetting state every time Uh, corresponding to that picture Uh To do that, uh, we need to return the hidden state We need to return the output state and the hidden state of the r and n and we would feed it, uh Well recursively on the next step So that's what happens um Here and also I uh Sometimes we initialize the hidden layer although we could use the default initialization that strictly speaking Is not required, but if you want to train your neural network faster It's important to pay attention to the initialization In case those things would typically be done for free When I say for free, I mean there is meaningful defaults So, uh, the only difference in the training book compared to the previous case would be, uh, in this line When we do the forward pass we also return the hidden state and we feed it in the next mini batch duration as an input to model beta Uh, for the first iteration users the randomly initialized values Uh, calculate accuracy differently. This is what I mentioned when we use keras We just say act uh metrics equals accuracy and it will decide what accuracy typically, uh, and uh Here since in the previous case we used, uh Softmax so we had to use for maximum argument And then match the predicted category and here we have a sigmoid so we have to look whether it's above 0.5 or less and then compare the target and sum over all the the entire batch And again in the test we'll use a torsional grad context Manager to Not to force not Creating the graph in the inference stage and that saves a lot a lot of time As you will see that uh, basically the default implementation will be a lot slower and uh, that's one more task maybe for some for To figure out what it actually makes it slower in this case, uh, because uh, to figure that out We would have to look at the uh, actually keras source code implementation and see uh, what is it using for improving the performance so Yeah, that's all I have the only thing is for the first exercise when we do the sequence classification um to convert The training from feeding the entire added sequence all in once To the stateful training. We would have to uh in the keras version. We would have to uh When we construct the model we would have to say stateful stateful true And we would have to change the batch generator to feed the data in um, subsequences Similarly here In the torch version, uh, we would um return the hidden state Here and we would pass it uh recursively because it's a Yeah, okay. Well, that's all uh for me. Uh, thanks