 So far, all of the recurrent neural nets and LSTMs that we have looked at go left to right. They operate over time predicting the future given the past. This makes sense because that's how time works. We typically want to predict the future given the past. If we have a sequence of measurements from our network, we want to know what the utilization will be tomorrow based on today, yesterday, the day before, or next millisecond based on last millisecond, preceding millisecond, whatever the time frame. So most recurrent neural nets are used to predict the future based on the past. However, there are a number of tasks that do not have that property. For example, lots of translation, you have a full wiki page in English. You wish to translate the full wiki page into Chinese and rather than going one word at a time translating the words as you go along, it's easier to translate more accurately if you can look ahead. If you can see what words will be after the word you're translating as well as the words that will be before it. And this is particularly useful because different languages often have different word orders. Some are subject verb object. Some are verb subject object. So they may order things differently and it's tricky to translate simultaneously. Something like German where the verb might be dumped way at the end of the sentence and you don't know what verb to put in English until you've actually seen the verb at the end of the German sentence. If you're going German to English. So by STLAM, by LSTMs are a technique that are used to do this. It's a simple hack and it's rather quite beautiful. So here's the idea. We have the sentence that we want to translate. I went to and then so forth yesterday and we will run a standard neural net, say an LSTM, over this predicting the first in the state, second, the third, until we get to the end we get h sub t. We will also take another second LSTM and run it over the exact same sequence in reverse. So the first word for the second neural net will be the last word. We get it in the state, starting with our end of sentence marker and it will go then predicting sequentially using a language model that predicts them until eventually it sees the first word, predicts the end of state, and sees the beginning of sentence marker and is done. So we now have built two sets of hidden states, a forward LSTM from a language model like we've seen before and something which is an identical neural net in terms of structure, but with totally different parameters uncoupled, that predicts the words in exact reverse order. Now, when we want to predict the label of the outputs, what do we do? We will just take each of these pairs of hidden states and concatenate them. So we've doubled the size of the hidden state and to predict a label, is this a noun or a verb or a pronoun? We would just use as input the concatenation of these two hidden states or these two hidden states or whatever. So this allows us to both see the past and the future. If I want to predict a label on x3, I know the hidden state h3 from the past and I know the hidden state h3 estimated from the entirely separate network from the future. That's it. That's a bi-LSTM, but it is a key fact to think, are you only using the past? Which often is all you have if you're doing live translation of speech to text. You don't know what someone is going to, what sounds they will make in the future. Only you might pause for a few milliseconds to hear what's coming. But it often in sentences, you have the future already from your web pages. So one example of where this works and has been used is in a task where you have either trying to say how similar are two phrases, two sentences. Do these sentences, are they similar? Do they mean the same thing? Are they different ways of saying things? I want to eat dinner now, or I'm hungry and I'd like dinner, roughly a paraphrase, sort of the same thing. Or in a very different sort of cases, one could take an image, embed it and try and match that to a caption or take a caption and match it to an image. And if you have a caption, you can run the caption forward and the caption backward and take the concatenation to see which image best fits that caption. And somewhat surprisingly, using something that uses a bi LSTM is often competitive with fancy models like BERT, which we will see next week. And part of the reasons I'm covering these is that we will build BERT upon sequence to sequence models and things we're doing this week. And partly because each year what the winning algorithm is seems to change and whether transformers are winning this year, BERT, but next year it may be back something that looks more like a CNN or like an LSTM. So what does the architecture look like for Infraset? What we do is we take in a sentence, Emily buys a billion cookies, we then embed them, they use glove, basically it's another version looks like BERT, a 300 dimensional embedding for each word. We've now embedded each word. Then they run a forward LSTM over the sequence of embeddings. Each hidden state here is a dimension 2048. They also run a backward LSTM, same architecture, but run first input cookies, then billion based on cookies in the state and so forth. And they get another sequence of hidden states of exactly the same size. They then concatenate each of these hidden states, boom, boom, boom. So now we have something of size 4096, a hidden state for each of the tokens along here. And then they need to take all of these and there's a different number to based on the sentence. You could in fact, if you wanted, take these and average them and get it over embedding of the sentence. But in fact, what they do is a max pooling, same idea that we've seen in convolutional neural nets. And so they take the maximum of each of these features. And so they get the maximum of the first each of these. So they now get a total embedding of size 496, which is the maximum of each of these hidden states. So they've now taken any sentence and embedded it. They then take this embedding of the sentence, feed that into a simple model that predicts, well, what do we want? We want, for example, to say how similar is this to some other sentence. And what we want is to embed this sentence and to embed the other sentence in such a way that for sentences labeled as similar, the embeddings are as close as possible. And for sentences labeled as differences, the embeddings are farther away. So you can construct a loss function. Similar sentences should have similar embeddings, different sentences should have different embeddings, optimize that whole loss function by you've got it gradient descent. And then any new sentence can be passed through this pre-trained embedding, find local hidden states, concatenate, extract one for the sentence and use that well for any task that you want. And it works remarkably well. So summary, by LSTMs, often are better and trained faster than the simple LSTMs. Sometimes they train more slowly, who knows. It certainly is the case that you can do better training if the information is closer and note that the information be in future words or could be in past words. That's the magic of LSTMs. Obviously, this is good for something like translation or paraphrase detection, how similar are these two things when you know the future. Obviously, this doesn't work if you have time series where you want to predict future and you haven't seen the future. But when you do know the future already, say in translating a web page, these work really nicely.