 Before we cover recurrent neural nets, which we will do in full detail, I need to do an important and slightly lengthy digression on embeddings. Because many of the recurrent neural network models and basically all of the natural language processing models use some sort of embeddings, ways of mapping words or other objects to vectors such that similar words are close. Now, there are lots of things, objects, things we might want to map to a vector so we can take it as an input to a standard neural net or to a recurrent neural net. And the idea of embedding is that objects that are similar should be close in the vector space. They should have a small distance between them for some notion of similar. That will allow things to generalize. Now, lots of things can be embedded. You might think that images don't need to be embedded because images are already a vector, but that's wrong. Two images may be quite different in pixel space, their original input to a neural net, but if you take the output of their penultimate layer, the second to last layer, the last set of activations of the neurons before it goes to the softmax, then what you could find is that the embedding, the outputs of the activations of the neural net are in fact more similar for two images that are both elephants or both plants than if you were to look just at the raw pixel space. So often it's useful to embed even images that are already vectors. But certainly if you have words or sentences or documents, sets of words, really everybody now embeds those discrete objects like words into vector spaces such that two words that are similar, we'll define this more precisely later, should be close in the vector space. You can also embed products, iPhones, androids, people. Any discrete object can be embedded in a vector space such that objects that are similar are close in vector space, and this provides a really good way to do inputs to a neural net as opposed to actually trying to capture the object itself. It also means you can often take objects you've never seen before, find their embedding, and use those. We'll see examples of that with words. Cool. So this week we're going to focus on the simplest of word embeddings, word-de-vec, or it's very similar to another method called glove, which map each word. Typically we'll use, say, a million different words. Does English have a million words? It does, actually, if you include numbers and names and places and all these things. Map each word to a vector, a 300-dimensional vector, and embedding such that words that tend to show up in the same contexts, which are words that we'll define as being distributionally similar, will be close to each other in the embedding space. Now you might ask, is there enough room to take a million words and put them into 300 dimensions? And it says, yeah, 300 dimensions is really big. How big is the 300-dimensional vector? Well, if you put ones and minus ones in them, or ones and zeros, you could represent two to the 300th possible combinations, two to the 300th, way bigger than a million. So there's plenty of room to put all the words you could ever see inside a 300-dimensional space. We will cover a bunch of different, more sophisticated word embeddings next week, but I want now to lead up to WordDevec, which is what we'll use in the exercises in this week's notebook. Now the key concept behind embeddings is distributional similarity. And the idea is that words that occur in similar contexts are similar. So if you look at the sentence, he ate the sandwich and he ate the shridlu, I'm assuming you haven't seen a shridlu recently. What do we think a shridlu is? Is it a person? Is it a book? Is it a computer? Probably a food, right? Something you can eat. How did you know that? Because of the context. Similarly, if I had shridla ate the sandwich, what is shridla? Probably an animate being. Maybe a person, close to he. Maybe a shridla is a small animal like a dog that eats whatever it can get. But it's certainly something that looks like that. Without understanding the word ate or sandwich, we can just take statistics on what context does shridlu show after? After ate the, it's something that's edible. It will look similar to other things like meatloaf or noodles or dosas that show up after ate the. So note that words which are distributionally similar tend to be similar, not just in terms of part of speech, like being a noun, but also similar in terms of being something that can be eaten or something that can eat. And we will use this over and over this week and next week. Now, back in the battle days before deep learning, people treated each word as a unique symbol. And if one were to do that for a neural net, a word like hotel is going to be a million dimensional vector. I didn't show a million dimensions. Which is it the word A? Is it the word Apple? Is it the word Ardvark? And so forth. And eventually it's the word hotel and it's not the other 999,999 words. It's one, one and a bunch of zeros. Similarly, another word like motel is a different one and an equally large number of zeros. And if I ask how similar our hotel and motel, all pairs of words in one hot encoding are equally similar or dissimilar. But that means there's no generalization. Means there's no way to easily learn from them. So we want, first of all, something that's lower dimensional, not a million dimensional, but say 300 dimensional. Or next week we'll do things that are 768 dimensional, but low dimensional compared to the words. And we want things such that we can actually compute similarity. So we can learn that if we learn to recognize hotel that motel would be similar if it is. Now the classic method for doing vector embeddings back from my childhood is latent semantic analysis, LSA. And this takes each word and looks at the documents it shows up in. So we take a whole bunch of documents. They could be web pages or text messages. And we're going to take a big matrix of word by document. Every word and for every document is it in it or not. We will run singular value decomposition on this or take this document times it, matrix times this transpose and run PCA. If you're more familiar with PCA. And we will then get a reduced dimension principle component or singular value decomposition that maps each of my million words to say a 300 dimensional vector, k dimensional vector such that words that tend to show up in the same documents will be close in that embedding. This will make all related words close. Words like doctor, hospital, nurse, cancer will tend to show up in the same documents. They will then be related. They will then show up together and have similar embeddings. So let's look at that. Here's a trivial LSA example, four documents. First document is I ate ham and cheese. Second document you ate cheese and crackers and so forth. You see a matrix here. Each row is a word for every single word in my vocabulary. Each column is a document for every document in my collection. My corpus use the fancy piece. You can see the word eight shows up in documents number one and two, but not in documents three and four. If I now take this matrix and do singular value decomposition, what I will find is that the first principle component will tend to load that eight and cheese will be close to each other. They'll have similar reduced dimensions and hospital will be close to doctor and nurse in other words operation that will show up. And we'll see that related words have similar embeddings. Now most people don't use LSA anymore. More modern is to use something like word to beck or glove, which instead of using a word in a document, takes the context of a word as being the words right before it, the left or right after it to the right. And the early versions used say five words before the document, the token and five words after. Modern ones use hundreds of words on both sides. Large contexts, but that's next week. So now we're gonna do is the same idea as LSA. We will take a word by context matrix, which I'll show you on the next slide, run singular value decomposition, and we will then map each word to a low dimensional, say 300 dimensional embedding, such that similar words are close. With LSA it was related words, doctor, nurse, hospital, cancer. With word to beck, it will be similar words, doctor, nurse. So what do I mean by similar? If you look at I ate ham and you ate cheese, ham and cheese are distributionally similar. They show up in the same immediate context. I and you are very distributionally similar. They show up in the same sorts of context before the same sorts of words. And what is eigen words? Or if you think of the way I'm doing it formally here, or word to beck, which we'll cover in detail next week, do more generally. It takes every word like eight and says, what word did it show up? Which word showed up before it? I showed up before it once. You showed up before it twice. What word showed up after it? Cheese showed up once after it. Ham showed up after it once. These other words didn't. It takes each of these words and each word is now a big vector of the context it's been seen in. And now if we do a singular value decomposition, then words like cheese and ham that look similar in this big space will be embedded to similar vectors, similarly words like I and you that look distributionally similar in this big space will be embedded to similar vectors. We can then take all million words in our English vocabulary, take billions of words of English text, Wikipedia, the web, if you're Google. And now we can just take for every word what's all the tokens it sounds scenes around, all the context it's in, run a big SPD, which could be run very quickly. And we can then get a vector that represents every word. If we then take any small set of words and you'll do this in the notebook in just a second and take say here's a set of words that I picked and you can take these words and plot them, taking their first two principal components in these sort of plot that makes close words close. Remember each of these words like river is a 300 dimensional vector. But what do you find? You'll find that words like boat and car and truck are close in this 300 dimensional space. Words like dog and cat, which you can't see underneath the truck there are close in this space. Home and house are not too far away. Nouns are close. Things like listen and carry and eat. Verbs are in a different space because verbs are distributionally different from nouns. And words like agree and disagree are incredibly close in their embedding space. Why? Because they mean the same thing? No. Because they're distributionally similar. I agree with him. I disagree with him. I agree with that. I disagree with that. Agree and disagree are almost identical distributionally. They have very similar embeddings. Subtly different because one agrees or disagrees with slightly different things, but not very different. So what have we learned here? We've learned a method and you will then now do this on the notebook that says take every word based on its context, embed it, and now we have a way to measure for any word how similar it is. We will then instead of giving words as input to recurrent neural nets, we will use word embeddings as inputs. Works much better to use a 300 dimensional embedding than a 1 million dimensional, one hot representation of words.