 One of the key tricks to getting natural language processing to work well is to use good vector embeddings. So I want to go into a little more detail of distributional similarity in vector embeddings than we did last week. So what did we say last time? Distributional similarity is the fact that words that occur in similar contexts are, by definition, similar. Similar in terms of their embeddings by construction and similar, we hope, in terms of their meaning. Now note that this works sometimes and doesn't work other times. Think, for example, he drank the port. What would the embedding, the context-sensitive embedding a port tell us? Well, it's something you can drink, like maybe a kind of wine. He visited the port. That's different. With enough different context, we could see that support where boats come. And there are lots of different ways that any word in English, for your favorite language, Chinese, different meanings they might have, he's on the port, not the starboard side of the boat. He needs to port the code to PyTorch, a verb usage. So if we do a context oblivious embedding, we will just average over all of these different meanings of port. Putting more weight on the more frequent ones. We will move, of course, to do context-sensitive embeddings that will then capture these many different meanings. When we look at distributional similarity, remember always, of course, that words that show up in the same contexts are mapped to be close to each other. But that doesn't necessarily, they have similar meanings or styles or emotions. Often it does. Often it works great. And remember later that we will take these initial embeddings as a starting point, and then fine-tune them, adjust them for the task we want to do, so they actually capture the meanings that we care about. Okay, I said several times, but it's worth repeating. There are context oblivious or context-free embeddings, like latent somatic analysis or word-devector glob or fast text. Each of these maps, every word in the vocabulary, I used typically 40,000 words at pen. When I was at Google, I used a million unique words. Maps each of these words to a 300-dimensional vector. Increasingly, people often use context-sensitive embeddings. They embed each token or each byte pair to an embedding. Map that token and all the surrounding tokens in a big area. Dozens, hundreds of words at either side to a, well, typically, 768-dimensional vector. And then either of these can be put as inputs into our deep learning models. Now, there are two basic context-free embeddings styles that people use. One is a skip-gram-based embedding. We'll see the other one in a second. And the idea of a skip-gram is that we'd like to build a model that maximizes the likelihood of the model given a set of words around a given word of interest if we're embedding the word loves. We'd like that word to be something I can predict the probability of seeing Murat the or man or his or son. The man loves his son. The man, his son, are context-words. And loves is our target word of interest. And we'd like to build a model in embedding in a second that maximizes the probability of this. By the way, the gram here, an n-gram, the man, a 2-gram, the man loves a 3-gram. So, gram just here means word or token. Okay, so how do we model these probabilities, the probability of the given loves? How likely is this word to show up in the context of my target word loves? Well, what we're going to do is to compute the probability over the whole capital T tokens in my sentence, or my string, my document. And what we're going to do is the probability of each neighboring word, w sub t plus j, could be to the left or to the right, given the target word w sub t. That's loves, and these are the words the, man, his, and son. And the model we can use, really simple. We'll say the probability of our target word, given each of the context words, is just the softmax of the similarity, the dot product between the embedding of our target word and the embedding of the context word. Now, of course, we need to learn, that's the whole point, the embedding of my target word, loves, and the embedding of the context words. So, we will use gradient descent to maximize the likelihood, the log likelihood, the pick, the embeddings of each of the words, such that we maximize this likelihood. Okay, and there's only one little trick, which is to estimate the probability of something given something else. We need to know how often loves shows up next to man, and how often love doesn't show up next to man. I've just shown you it showing up. We need to count how many times it doesn't show up. Well, it doesn't show up a lot. So, in fact, people do some efficient negative sampling to find occurrences of when it doesn't show up. Details, not so much our problem, because we will use pre-trained models based on these skipgram-based embeddings. So, the important point is, given the context of a word, we can use that to estimate vector embeddings that do a good job of capturing the distributional similarity of words. Now, there is also a second embedding style for context-free embeddings, the continuous bag of words, CBOW embedding, which is also often used. Word2Vec offers both skipgrams and CBOW. Fastex also does. And the CBOW, continuous bag of words, looks almost exactly the same, but instead of having the probability of each context word given the target word, we now compute the probability of the target word given each or all of the context words. So, it just flips the direction of the probability around. Again, embeds each of these words. Again, uses a maximum likelihood estimation to find embeddings of each of these words to make these probabilities as accurately as represented as possible. So, either skipgram or continuous bag of words give very similar embeddings context-free. They're very efficient to compute. One can do these on large data sets, and you can download really nice ones, which you'll get to do.