 The third basic architecture for input-output is my favorite, sequence-to-sequence. We'll use these a lot next week. The idea in sequence-to-sequence is you take in a whole sequence of words of arbitrary length, the quick brown fox jumped over the, etc. This then goes into an encoder network. The network will see that looks very much like the networks we've seen before. At the end of this network, there'll be as usual output a hidden state, h of t. We take that hidden state, input it to a second entirely separate network, a decoder, and the decoder starting with that input then produces a whole sequence of words. For example, a translation, de renard coin rapide sort par le su. So input sequence, output sequence. Drawing this back in the pictures of the form we've been looking at, again, we have a standard language model, predicts given each word, predict the next word, and at the end of that there is some hidden state, h of t. Now, we're going to have an entirely separate network, b, sort of mangled in the picture here, sorry, the decoder. Decoder takes in the hidden state, it first then produces one word, like ich. It then takes the output of this word and the hidden state, takes that to a new network and produce a next word that's emitted, ging. It then takes that and predicts the next word. So at each point this network is being trained to predict one more word, it does that until it hits something like a end of sentence token. So it'll be a special token saying stop, we're done. So this can now be trained end to end, which is to say given a set of pairs of English sentences and just case German translations, it then does a standard gradient descent on the parameters simultaneously the parameters in the B model, the decoder and the parameters in the A model, the encoder by two entirely different neural networks, such that these networks will then minimize the softmax loss across these different words. So this is super cool because we can use this for lots of cases where we have a question and an answer, a word, a sentence and a translation. Now the training is a little tricky, and so people usually use a variety of advances over on this, most of which we'll cover next week. But one simple one that I like is called teacher forcing. So if we're encoding a sentence, they are watching period end of sentence, then we take the hidden state output there, that hidden state is input into a new different decoder, this is A in blue and B decoder in white, which takes in a beginning of sentence token, and the hidden state from before, it does its first output, then the input to this could be either the word that was generated here, whatever it was, or it could even be the correct word that it should have been, not what was predicted, but the actual thing that was the correct translation, then it learns given the same network, it's a copy of the same network, hidden state plus the word, they got, sorry, the French dropped the silent ENT, and then that is a prediction, if that's correct, great. If it's not great, we can still use the teacher to say, this is the word it should have been, put this in, predict the period, and the end of sentence, right, because the period, maybe end of sentence, maybe something else. So this is the exact same architecture as before, encoder, decoder, but one that rather than trying to put just hidden states, or whatever the last best guess is, is at training time, given the correct word that should have been there, then at runtime, when it's actually making a prediction, of course then it doesn't have the true word il regard, it just puts in whatever it predicted before as its input. And this vastly makes the gradient descent easier, the search to find the right weights. So sequence to sequence, give it a try.