 Hi, so I'm not an expert on PyTorch, just to clarify, PyTorch is one year old, so I'm maximally three months, seven months, somewhere between vacation days, so using it. And yeah, today we're going to go through this model, an actual consequence sequence, it's fun. And let's see what we're doing. So who saw this problem from the post from Facebook, just like each of you who saw it, who read the papers that I posted? That's very little fun. Let's try to get through as much as possible. So there's this problem where you go to the coffee shop in Singapore and you're not Singaporean, you try to order coffee, you try to order kopi and you try to imitate the Singaporean, what you'll get is, oh it's not, it's not on this picture. So kopi is actually coffee with milk, it is not black coffee. If you want black coffee, you need kopi-O, which then it also comes with sugar. If you want black coffee without anything, you have to say kopi-O koso. And the list goes on and on. This is just six out of a hundred other paper forms. So for those who have not read the papers, this is just a crash course. Most likely it will take around five minutes. So who knows what's this? Perceptron. Perceptron, who said that? Is that an Autobot or a Decepticon? Sorry? No, it's not a transformer. Although a transformer is also another thing of architecture. But all it does is it takes the input axis, tries to summarize them with some function, most likely a summation, because it's simple, multiplies it with the width that it comes with and then goes through some transformation so that it fits into this curve is actually a sigma curve, which means it just wants the output to be detain 0 and 1, all minus 1 and 1. So who knows? This one. Anyway, this is called Elman-Nap. This is, in short, this kind of architecture or this class of architecture is called recurrent neural net, where first you will take the x, which is your input, and then you go to some healing state which then is nothing. It goes past it to the next one, takes the x, and then it passes it to the next one, and you try to predict the y as it goes on. So there's no input output here, the output is actually some sort of your input as you train, and as you go on, you know that it can go on forever, but because of GPU, there will be a cutting-off point, and then this is how you predict. So this is one way to structure recurrent neural net, which is a very, very old one called Elman-Nap. But this is also something that we'll see here now. So anyone understand why is it previous healing? That's very important. And the difference between this and the previous one? No? Okay, let's go back next, okay? Previously, there was nothing, right? You only have one state. Everything is an x, a feature, it fits in, it goes in. But somehow now you say this x is related to this x, you're trying to learn the ways between here, between here, before you get to the y. So that's why it's recurrent, in a sense. It just kind of passes on. And what's really important is this healing state is what it's trying to learn, and it passes on to the next one. But you realize that somewhere here, it's very hard to see what's there. So that's the weakness of recurrent neural net. Something that's very helpful, it comes from trust, block from down there. So previously we see this, right? But if the input, it goes to the next one, it gets the previous and then, this is how it actually looks like. As an input, you take the previous hidden, it takes some sort of and learn and hidden information, somewhat in blue. And then it pass the blue and red, it comes out with a mixed hidden information, blue, red. And then it goes on and add more until you get the final output, which then, this is what you're learning. You want to know how are the different inputs related so that you can predict the output later. In short, this is recurrent neural net and you're free to go. Oh, no. This didn't work. It worked in class, where the students just kind of left. So, okay. For sequence to sequence, it's basically recurrent neural net in a different form, but it does, it is meant to input an output sequence. Previously we had recurrent neural nets going like this, where you don't know what's input to input. The output is mixed in. So, after evolution in 2010, 2014, 2011, they came up and said that, why not we input something through as an input and then generate some sort of a vector in between and generate output? It kind of worked well for everything that has a sequence, machine translation, text generation, Python interpretation, this is actually a very nice paper if anyone read before. And that's what it does. So, you have the inputs previously you go through and now you also go through, but it goes to this encoder and it compresses into a single vector. At the end here. So, after you see this end symbol, it becomes a single vector. And from the vector, this becomes a stud state of a decoder that tries to predict an output. Makes sense? Anyone lost at this point? No. Maybe this is too simple. Yes, before anything, data lunch, because no data, you get nothing. Although, although we can't value the value of data, I think the value of data is priceless. Yeah, scientists spend around 80% of the time just comparing data and managing data analysis. That's what I do. And it's very simple to manage data. First, you take data from somewhere. You then throw in some pandas, S-frame, dust, whatever, get the data into some shape, then repeat until it's done. It's a nice recipe. So, I have uploaded a very small data that I just copied from a few websites. As you can see. And then we have the local term, which is like deopo and gomikata, which translates to black coffee with extra condensed milk. And that's what we want to do. So, the left one is called a window. It's our input and this is our output. And that's what we are going to do. So, it's cool. Am I going too fast? Am I going too slow? I'm going to cheat because I have a full notebook here. Just in case. I will try to speak a bit. Actually, you don't have the notebook to cope with me, but then what will happen is I put a lot of question marks that I will try to fill in and most likely it will help because he knows Python here, right? Who doesn't know Python? You should try to apply to the program after you learn Python. This is just some I hide this stuff. And what we are using is actually a torch library. You can install and pip install. You go to PyTorch. You will see the instructions kind of clearing in your face. So, this line. Do you guys see the code clearing? Is this better? So, anybody know how to pass and pass? So, this line is very, very irritating. Because of this line you have to go through every variable and say .cuda just because it's kind of difficult. That's why you say something. There is a way to not do this. Yeah, you don't have to do this. There is a way to not do this. Yes, yes. You can set, based on this used cuda you can set a variable equal to a type either being a cuda float or float which type of that type. But if I want to change it to a tensor I must create multiple types. No, yes, but every one of them we got type with the typing. That's where... That's where tensorflow is easier where you just set a flag and everything runs. But... So, remember this because this will pop up everywhere when you debug. Very irritating. These are just some conventions that most torch users use. I don't know why they call it F because they like to write algorithms and all that. That's what we're supposed to do. So this is just an introduction that we went through. These are two tables that we should try to read. This laptop has no GPU. Yes, but it was still run because the number of rows is... Remember the number of rows in the data. Yes, that is affecting. I just cut off the last one. But it's small enough to fit in. And the vocabulary is small. So this is the task. We have an input we want to back to output and hopefully it does this spoiler alert. It cannot do the last one. So let's go down. Oh, it's actually here. Is it okay if I sit down and type? Using Jensim, this is actually a very nice library that you don't need to manipulate your own dictionaries and you don't need to do like back-of-words or text. We'll see it later. The pen and data frames look as such as just not very explained. And this is what you do. Remember in the recurrent neural net, there's always a start and an end state. From what we see, we need to start and then at the end of the encoder, the start of the encoder must begin. So this is what we're going to do. We're going to assign this symbol as a start and this symbol as the end with the index 0 and 1. And the first thing to do is to manipulate the pandas data frames to what we want. So let's do something like this. Please correct me if I'm wrong because I'm typing out of brain. I think I can do this. Food has capitals. The simplest way to manipulate your data is just to lowercase them. Although there's a lot more, you can refer to my pie. Is it Python mid-up or Python mid-up? Python mid-up. The first way is to lowercase them and then this word tokenize is a function for MLTK or you can use any tokenizer. In this case, you can actually dot spaces that also work. So the idiom is actually kind of simple. You have a data frame in the column. Because string.lower is a function, you can do this. And then we apply it. Like this. So there's a lot of code floating around doing like this. Please don't do this. Apply word tokenize. Sometimes you want to remove the stop words and you have stop word list. You go like that. Stop down the list. And then you dot apply lambda x x. Just for your information, this is the right way to tokenize and then remove the stop words. Some people have been like doing it in those steps and doing apply again and again. Apply is actually a very, very expensive operation. So try not to use it. Sorry. It didn't work so well. So yeah. This is the right way to tokenize and remove stop words together at the same time so that you don't use to apply. So yeah. That's all. That's what we get. We get a list of tokens which is the list of words that we want to include in the end symbol. So the problem is this is very nice. Actually you can fit this to a referral that you'll accept it and say that this is very nice but then torch will go kaboom because they like numbers and they don't like correctness. So the first thing to do is to convert them into numbers like this. And that's where you use Genshin dictionary. This is a hack actually. So if we look at this and I just add. If I just do a dictionary of the English sentence values anybody know how to print a thing. Thought. So with the dictionary you can actually fetch the words. So let's say zero and then you have one. You see there's something wrong. Anybody spot what's wrong? Yes. Because what I wanted was to get zero index here and one index here. And if you just let the dictionary side it will just go in some sort of hash because of how hash is just written. So one way to force it is to just say that I have a document of one word S and a document of one word with the end symbol and the unknown words. Unknown words don't occur in this exercise at all but then I'm just putting it in for fun. And that's how you hack it such that you do this. That's right. And then afterwards you add in all the documents that you want. Which is the English sentence. That still happens. And voila, we get the order. So that's one way to hack. Is this too boring or I think this is actually quite important because so we get English sentence like this and then the very very fun part is just now we created the vocabulary. We just feed in any English sentence and do a dog dog to IDX to get the IDs that are in the dictionary. So if you have this English sentence right, maybe do English vocab dog, dog to IDX Oh yes, who knows what is wrong here. Because you're a dictionary, you'll be wrong with the opinion. Not true, who knows what's wrong with the reaction. Yes. I commented this like sorry. But yeah, that's what you get. So that's one thing here. This is actually a very new feature. So if you can't do this and you see this attribute error just do pip install-u-genc you'll get it. And we requested for this. Like this is very helpful. So now you can do vectorize. Like everybody is more effective. But this vector is a list of integers which is not very helpful. Because PyTorch wants everything to be or some sort of PyTorch data structure. So you need to convert it into this variable question mark. Who knows how to solve this. I actually don't because I'm pretty sure it's wrong. Like there's some long tensor because it wants a tensor and then there's the last one. And I get it right. I'm gonna cheat. Anyone knows what's wrong? You know, this is the part that PyTorch should just remove the Python part. It's kind of unintuitive why is it not there. And if you import all types you're just gonna mess up your namespace. So just remember this by heart or cheat by me. But never mind. Is this still very boring or interesting? Just check. And what we really want is to remember our inputs are pairs of sentences. They want to fit into the model. So that's what we want. And if I just add a cent pair here I should get something. Where there's an input and an output it always starts with a zero always ends with one. Because that's our style and simple. And in between it's some coffee something I can remember because I know what street is. It's always coffee. And yes, the sequence to sequence model. The general idea is to take two neural nets and transform one to the other. And that's where encoded and encoded comes in. The simplest kind of neural net you can quote in actually this is not the simplest but a very simple encoder that you can quote is just to take the input put it through an embedding layer then it becomes embedded which is the output of the embedding to the GRU which is an RNN. Because the GRU just imagine is an RNN you see there with some superpowers that can remember but remember this previous hidden this is what makes an RNN an RNN so this is very important and what the RNN always produces this output and the hidden. So question, where does the output go? Where does the output go to the encoder? Partly correct. For this case it actually goes nowhere because it's very simple. Then where does the hidden go? I sense the aura but yes the hidden. The output The output will be the encoding but it does nothing in the encoder yet because it just get passed on to the next one. But then it is the hidden that really moves so if we look at the slides remember it's the blue thing that moves and blue things are hidden for now. So where does the hidden go? Anyone from the back? Nothing. We know that this is the previous hidden so this hidden is going to get looked back here to the next stage and this is why it's recurrent because it's looping all the time. It's loopy. Just remember this graph or I should not try this one. So we have the input right so this is how you create a class of architecture or network itself in PyTorch you always import from the module because all kind of network encoders are all modules. There are ways to do this separately but remember always use modules. They will always use this super self in it because there are other things that you can initialize but I'll skip it. What's really important is the input size and the hidden size. That's why you need to decide how many nodes you need to put here. And 20 doesn't matter for now because we're only building the architecture. As we move on you see that we're just taking in the hidden size and we say this network contains hidden size of this and then we have to create an embedding. Anyone know how to do this? It's really simple because this part is more PyTorch. So neural.embedding embeddings take two things right it takes the input and the hidden size. And then if you go on the GRU does the same thing it takes the GRU and then input. I think I'm wrong here. Yes that's where I'm wrong here. No, no it has to take the hidden size because that will determine so the input size is the number of words in your sentence that's going to be the input size. The hidden size is the number of nodes in the embedding. You have to know how many nodes are there. So it has to take the hidden size. In this case I'm making the hidden size of the embedding and the hidden size of the GRU standard but you can separate it out and say hidden size 1, hidden size 2. So I'm just being very lazy and I say hidden size all size of hidden all embeddings and hidden layers are all having the same size for now. So does it make sense? So embedding always have a size but it cannot be size 0. So it must be size more than 1. 1 normal. Similarly remember when we move forward our output in the hidden is just trying to move the GRU again. So I have self.gru which is here and it takes the same thing the input size and the hidden size. No question here. So typically for LAP pipelines you take the input and you change it to some embedding like word-to-word or fast embedding. So what's the difference between learning to wear fast bags like this and embedding like that? Yes, there's no embeddings in word-to-back that will give you a complete over. That's a bad job. But the thing is there's a few ways to get embedding right. One is you can don't train this at all and make it fixed. But what I'm saying is I'm using the embeddings as part of training to train a new set of embeddings for the given inputs that I have. So if you want to use fast tags or word-to-back what happens is you remove this and says all of these weights are fixed and you don't train them then you wouldn't need the variable you wouldn't need the embedding layer you would just need to use I think pre-trained layer or pre-trained module.pre-trained something but that's where you can replace this with something else. And then remember that we always start with some sort of hidden state for now let's just say we create a one-one hidden size who knows what the size is self-proclaimed shouldn't be taking the embedding Yes sorry you are right. Anyone knows why but let's go back again so it's very simple once you have drawn this graph we just ask a professor to draw the graph for you and then as an engineer what you do is you make this graph into a code so you put the input so this is initialization but then when you move forward you need the output of the embedder which is here and the previous header which is here sorry I mixed that up with the initialization so when you create the embedder it is going through the embedding and then it just shapes it into a single it took me 10 minutes and killed my brain nobody knows what spinous one sorry it's a flexible size you are not stressed by embedding that yeah why don't they use now a question mark never mind it's just very strange so it will never blow up because it doesn't do anything it's an object so the code is a little more complicated because somehow the default tutorial at the redo but then I'm going to skip it I'm just going to tell you it's a nonlinear unit that we see in this slide here this F here it's just a transformation you can put any transformation you like you can alternate your transformation every other loop if you like it but that's just for you so most people just keep it fixed so similarly we have the same thing remember the decoder is just some sort of extension of encoder but then in this case we have to predict an output just now the output was pretty much useless so the input goes into the embedding it goes through the transformation it gets to the GRU takes the previous hidden which comes from here produces the output which then these are just they are just numbers right what you need to do is a softmax a softmax kind of choose what is the best fitter for the case so in the case of machine translation you have the input vocabulary and output vocabulary the output vocabulary is what you want to get the output here so this the softmax converts the numbers in the embeddings and then chooses the best fit and fit it to the vocabulary does it make sense most likely it doesn't make sense now unless you see the numbers all you have done this before so you'll get through later that's the really cool thing about PyTorch you can just kind of inject half way and say print something you can't do that in TensorFlow unless you use eager mode that will never be deployed at some point I don't know but yes same thing right it's very simple you get this graph from a professor you just code and it's like talk from embedding this is kind of strange because in this case in the input we have the input size that's our input and the hidden size but for the decoder the input is actually the hidden size because it comes from the encoder and then the output is going to be your output size so you kind of swap it around so you have your hidden size here which is your input and then your output size and then the GRU does the same thing it takes from oh there's a relu later it's not moving these are just energization so you have hero, that GRU at any point of time if you think me typing the code is too boring sound out, I can just skip and copy and paste but yes so this is the part that I always forget because it uses gru basically the software gives you a certain number the linear will output the output that you want but I forgot what's the input most likely is the size of the output so these are all energization so it must be all sizes it's nothing to do with the movement of input out it will blow up later if it's wrong then I'll copy and paste so the graph again we have embedding going to relu and then from GRU so this is the thing we have embeddings with the strange minus 1, 1, minus 1 now who knows why this happens actually there's no good reason why you should do this this is just an interface of high touch you can look at this discussion with the link here but just remember this because this could have been a single factor 1 times 1 times minus 1 equal to just minus 1 which is any size that you want yeah actually there's a reason why this sequence lag and all that I think it could be a lot simpler in terms of its interface but just remember this I can't read that much the relu is very simple it's just an F every time you have nonlinear functions always call from the F and relu we can change this to a sigmo stick with relu so after the relu it takes into the GRU so just type GRU self.gru takes the relu and the previous sequence output and then it goes through the softmax this part who knows how to type this you are initializing the hidden steps the hidden state from the start how do you type this anyone it's a one line answer don't type this copy and paste because this is exactly how you want to do from the encoder encoder and decoder is just an imaginary boundary that you draw but to the RNN to the network itself it doesn't really matter so it's exactly how you want to initialize this you could initialize it differently for the encoder and decoder but the simplest way is to initialize it the same way so that you can debug it easier so this is the cool part when you train a model you don't need to do wait till the end until you print something so I'm going to go through this remember we have the hidden side that you have to terminate for every kind of layer for now I'm fitting the hidden sides for the hidden sides to all the same there's a learning rate of the optimizers anyone don't understand what's optimizer and learning rate just double check good and then the best size is just how many times you want to see the same data again and again sorry the epochs is how many times you want to see the data again and again the best size is how many data points you want to put for time you write the rate it's what you're optimizing on for now you're just going with the negatives not lost and then the maximum length is a heck actually when you decode you can keep on going forever right if you look at the slides again I'm always going back to these slides because they are good it goes dot dot dot dot dot how do you know when to stop anyone anyone yes yes the slash s is very important actually the start symbol has no meaning and you can drop it don't tell your boss because your boss will ask you why I don't understand the output and the input but it works but anyway just put it there the slash s itself the end of symbol is actually more important than the start so that's why you need to set the max length so that you have sanity and then you can actually look at the slash s after some experiments you realize that actually this can go down further because at some point you'll just keep on producing and have to say this is the end this is the end and so on and so forth so what we have is the same thing input vocabulary, output vocabulary I'm just doing this so that it's easier to see remember encoder takes the length of the vocabulary which is the input size and the hidden size the decoder takes the reverse where the hidden size is the input is actually the size of the encoder and then output is the length of the output vocabulary That's where this part comes in, where you need to do a lot of cool-down again. Although it's not necessary if you've found the correct thing to do. Optimizers, you can change this. Optimizers always come from the OPTIM module. There are several. SGV is the most common. And then what you do is you take the network. Every network has a parameter. But then we didn't declare this. So where did the parameters come from? Answers? Yo. Okay, the parameters actually come from this line. Super God in it. It kind of created many, many things that you don't see. But then it's natural to create the parameters because that's what we are training. Remember, the parameters are the colorful words that we see just now, right? From the hidden states. And then the learning rate as per the SGV. And then we create sample data. So in this case, I'm just randomly creating, picking up batches of sentences from the back of sentence pair I have. And across each batch, which is E4. And I have a type error. Yes, it's wrong. I knew it. It's the dot linear. So, okay. When you see this, learning experience. When you see this, every time you see this dot mm error, most likely the linear can only go into the decoder level. So if we go back to the folder, this is not normal size. What actually is it? Yeah, here. I don't know. I'm going to check too. The linearity actually takes the hidden size also. Output size. I'm not sure why it takes the hidden size. I'm going to just leave it there first. It shouldn't blow up this time. But, logically it should work too. It works. So let's just end the line here. For sanity, understand what's the training data. So the first one is a batch. This is the first batch. And this is the second sentence. All the first sentence. We ran from a pair of sentences. Which is our something-something goby. And this is strange. There's 25 here. Never mind. I don't know why it takes this. So when we do the training, it's just trying to extend where it is here. So first we set the hyperparameters. Then we start the training. We iterate through the batches. For every batch, we initialize optimizers because we go back to zero. And then the log loss will be zero also because that's what we're calculating. And then we add an up later. And then here we go through each batch which is the start sentence and the end sentence. In this case, they're called variables because Python like the word variables. The encoder hidden is input variable. What should we do at the start of the encoder? Questions. Partly correct. So what should we do now? Remember our... This line here where this initializes hidden states that we don't remember. So that's partly correct. You can actually initialize it randomly. But for now, we're just initializing zero. So we should do the batches encoder.initialize through states. And it doesn't blow up. And then we iterate through each state, right? Because this is just initialization and there's nothing going on. It's just initializing all the input states and output states. So the question is this. Why did I already initialize in the class? But now I have to initialize it again. Actually, in the class I initialize nothing. There's no real numbers going in. So this is the real part where after the class you try to initialize and put the numbers in. So this is the really cool part about... I thought you can just do this and see. We look through and then we see that this is the last variable. This is the last variable without thought size. And then we see that the size is four. And then we put the variables in and then now we initialize. So in this case encoder outputs and all other things in the class itself has empty values. It's only when you're here when you start training and putting the variables. So at end we have this again. Remember we go through the... Each state, we start with zero. After the first state we have the first word itself and then we try to put it into the encoder. So how do we type this line? I'm just going to type it. Remember our input is the input variable. Remember our graph? The input variable is just the first word that we created here that we read from the data. And then the encoder hidden is the one that we have initialized here. And then it runs and then we iterate through. I'll be sharing the full one. Anyway, so the cool part is here. Whenever we move along the states we don't really know what's inside this encoder output and this encoder hidden, what are they really? So if we just look through and in the end we have already reached the last sentence and then from the last sentence we can see this. So in the last sentence we see that the encoder has... This is just printing what's in the encoder. The encoder is embedding of the input size and the hidden state 10. And the GRU just has the hidden and the previous hidden, right? So if I continue typing the data back which is here and the final sentence and the first word and the first part of the copy part which is the finished part, we see this. But if we move on, we see that the last sentence in terms of this is what we're interested in. We don't really know what's the encoder output, right? From here, we just see that the graph puts in the encoder output to the hidden and then just keep on looking by itself. But if you look here, this is just four words, five words and we look at the encoder output there are actually five rows here, right? See that? So that represents a state as we move on and you see all these zeros here we see the size is 20, 10 because we put a maximum of 20 and the 10 itself is the size of the embedding and the size of the hidden state that we put across. So that's the nice part of PyTorch where as you go on, you can debug by looking at this. You can also check what's in the hidden state which is very interesting. The hidden state itself is this. Let's go back here. When you have 0.6154 and the hidden state, remember the hidden state is going to be passed to the next state that means the hidden state is actually the previous state, right? Kind of mind-wrapping but you see that the hidden state is the previous state that was existing here. It's exactly the same. So RNNs, we have settled them already but then these are just the encoders. If we move on to the decoder part, it's the same thing that is going to be touched upon first So after we print out, after we encoded we want to move on to the decoder itself. It's the same thing here. You need to initialise a variable where we use the strange part touched on the variable and initialise the start so it must be encoded output. This is actually wrong. So sorry, this part of the code actually comes from the tutorial of PyTorch. You initialise it with the start state because you always create a start but actually this is not needed because you could easily initialise it with the output of the hidden state here but let's just follow and play on with it because remember since we added the start symbol to every sentence we should start the start. I think this was skip one and just say go on and then you get input and remember this is where the decoder state is wait, why is this good? This should be done. Just go through this, that will be easier. So the start of the decoder hidden state is actually the end of the decoder hidden state stuff and when we decode we get input and input and input and then we see all these strange syntax coming on like why are you doing this when you have the data and then pick the top K the top K is actually the best which is the soft place at the end but if you look at it this is what it's doing if we finish the loop get to the PyTorch and there's the output if we look at the top K we see the output the last word is actually this and it's mapped to 106 columns is the vocabulary size of the output so it's trying to map what is in the output here it's trying to map the decoder output to the words itself and what you see here is a lot of probability that it's belonging to that index so this top top K1 actually picked the best and that's why you see the best decoder gap of the world is actually wrong it's actually the yes because this is I didn't train anything remember we were training on how many batches we were only training on batch size of 2 with 30 columns so that's like 60 times so that's the coolest part of PyTorch anyway let's finish this so after you pick the best output you update the last PyTorch and then when you see this symbol you break although you can continue by the sensor so just break you propagate the loss backwards see this loss here is lost within the batch and then before the batch your loss is 0 you back propagate the loss and then the encoder and the decoder optimizer needs to step to the next step so that's all that's the sequence to sequence but let's put everything together put it in the function get some nice library basically I'm just printing out like how many seconds and how many minutes things go through and show a very nice graph and this is exactly the same thing I've seen just now but with more handy like you see printing out loss printing out loss so we set the hyperparameters again we reset and we start training remember the input size is to distinguish the vocabulary length of the input and then output is actually the vocabulary length of the output this is exactly the same thing as we see just now but I just put it in the function so that you can see clearly and it goes up and yes this is the most satisfying part of your job if you do AR because this is coffee but let me explain what this is going through so remember this number here is the criteria that we are training on this is the criteria we are training on which is the log loss I don't know why it became positive but anyway so this is the loss, the lower it goes the higher the accuracy of the system this percentage is because I had a fixed kind of iteration and I said that you go through so many epochs so many data points per batch that's why you can calculate this easily normally when you train you want to train this until your GPU breaks and it goes on and on and keep on talking and get some coffee and we'll finish soon I guess at some point you see it's getting lower which is good wait the graph is missing there should be a graph somewhere I think I may stop talking I'm not going to retrain the next thing that you need to really do after you train is to put in your code to save this is actually very cool this is called fString in Python you can actually just put the variable inside the curly brackets and then you can just save and it will save so getting the model to translate now this is the part that you say I'm just a I don't know crook and I'm telling you something it doesn't work it just shows it's not working so let's go through this because this is the last step the decoding part I should call it translate it takes in the model that you train with the input that you want to translate and then the next line is just for sanity again I'm just going to use the same one I put the input variable and the size the same thing that we did initialize the state like how we did previously for each sentence we initialize it to zero and then we move on again then the variable itself is exactly how we initialize our encoder itself and then here is the part where you look through each word in the you look through each word in the input variable in Python this could have in this yeah this is bad Python we should not look through this way because it likes that but yes this is just looking through each word and then as usual we have our encoder that takes in two input one is the input variable and the encoder here then we look again we store the encoded output and then what is this doing encoder output EI it says this and it overrides yeah this is the same thing where it just says take the encoder output repeat again and then put it back in encoder header which actually this whole thing is not used at all which we had raised my point in the graph the encoder has no use if I do this it should not affect anything so the encoder input same thing we start with the start index we put in the input and go to the encoder remember the hidden state is the first hidden state of the encoder is the end state of the encoder and then we want to keep the list of the words now the same thing that we did when we stepped through training the encoder takes the input of the encoder and then we do this top and that's where you get the numbers out here but this is the special part when you get to the input what we want to do is to not get the output of the hidden state just now but you force the input of the next word to become the softmax of this this will give you a clearer kind of way so every hidden state is kind of fuzzy you don't know where it is but since you are sure what you can do with the input of the next state with what you are confirming as the correct output of any word so in this case this is some sort of teacher forcing but it's not really because it's not teaching we are just forcing the output to give you the fastest output as fast as possible because in this case we know that the beam gets smaller and smaller as we go through as everything in the encoder output becomes