 Welcome to week 10, Natural Language Processing, or NLP, as you call it. So, this week we are going to do a quick overview of many of the different kinds of natural language processing tasks, and we will come back and revisit distributional similarity, which we covered last week with Word2Vec, but in addition to the context-oblivious Word2Vec embedding words, we'll look at context-sensitive embeddings and multilingual embeddings, which are pretty cool, and then we will turn to attention, one of the key components that's used for building fancier models, context-sensitive embeddings, and in particular we will build up to BERT, which is currently the answer to all of, if not all the world's problems, at least all of natural language processing problems, widely used throughout the commercial world. We'll look at a few variations on BERT. We will look at how to fine-tune it, how to adjust the embeddings using gradient descent, of course, based on the application, and then we will briefly touch on some of the huge language models like GPT-3 that are now popular. So what is natural language processing? A set of techniques for dealing with language, ways to let us communicate, to do translation, to hold conversations eventually with computers, okay they're currently not super smart, but we're getting there, and we're getting there quite quickly. So lots of different tasks, I will go through each of these rather quickly, but many of them take the form that can be passed into a sequence-to-sequence model, or a deep learner, a supervised learner, something comes in and something goes out. We want to often typically first learn a language model, a self-supervised enormous model trained on just text, and then take that trained model and adapt it to do some sort of a supervised learning. Information retrieval, take in a query like do I need to learn deep learning, and then map that to find a document, what's the document I would want, or what's the segment of the document. So it retrieves, it pulls back some document, query in, document or text fragment out. Contrast that with information extraction, here we take a query in, again, and a large number of documents, but instead of returning a document, we return a fact. What is the population of Philadelphia? It is roughly 1.579 million, according to Google, which knows everything, okay, so we have extracted information from a large document. We'll see how to do this. Natural language generation, classic sequence to sequence, incomes a prompt or a question, what fundamental economic and political change, if any is needed for an effective response to climate change, and outcomes from the computer, a response that generates language using a language model. Do we want to go through the same process we have been through for decades with no changes? Is there a way to build a sustainable, okay, onwards and onwards, right, text in, text out, sequence to sequence model? It works remarkably well. Cool. Classic natural language processing broke things down using linguistics, broke words, text characters into tokens, labeled tokens with parts of speech, nouns, verbs, pronouns, recognized named entities, people places things, and did co-reference and parsing, which we'll all see in a second. Many of these are not used in deep learning. The one we will use is tokenization. Into the computer comes a sequence of characters, out comes a sequence of tokens. Note that tokens include things like punctuation. We don't throw those away. Note also that by and large, we don't do stemming if I have cat, or kitten, or kitty, or sat, or sitting. We don't truncate sitting to sit because they mean something different. We take out the tokens, okay, and note the distinction between a word like the, and a token like the first token, the, or the second occurrence of that, which is also a token. Cool. We'll break sequence of characters into tokens often, people in NLP often then label them with parts of speech, which we mostly won't do. People also recognize named entities. You'll have people like Peter Strock, who's a person, a person who criticized Trump, is a person referring back to that person. The FBI is a geopolitical entity, a GPE. The New York Times is an organization that might label things automatically with entities. Co-reference. Tom, we'll see this later in today's week, this week. Tom was happy that he got a present he refers back to. Tom, not so hard. Tom gave Bill a present, but he didn't like it. Not so easy. Is the he Tom or is the he Bill? Probably Bill, right? But much harder. A nice task for deep learning. Parsing, which we'll also not do really in this course, because most of the deep learning systems don't do parsing to take a sentence and diagram it in terms of what depends upon what. Okay, classic things largely replaced by deep learning, which goes straight sequence to sequence. We will do mostly, as we saw last week, learning of language models. Given a sequence of words in a sentence, word one through word t, predict the next word in the sentence or the next token in the sentence, or this can be any word in the vocabulary. At Penn, I use small vocabularies like 40,000 words. At Google, I use larger vocabularies like a million words. But same idea. You can also do language models as we've seen over characters. If you go to Google and type in essay in space f, it will have a language model, which will say what are the most probabilistic completions. And given that Google is, it's San Francisco. By the way, what the automatic completion is here actually depends upon where you're typing it. Very clever. They use other information in the neural net. But note the language model is predicting what characters or what words will come next. We can also take those exact same language models and say I'd like the probability of a whole sequence of capital T words or capital T embeddings of words. And now that's just the probability of the first word times the probability of the second word given the first word times the probability of the teeth word given all the words behind it. So given a language model, you can actually also compute probabilities of sentences. We will find that although this is not directly useful, it's useful to use embeddings of these that take embeddings of individual tokens to do embeddings of sentences, which we can then use for classifying sentences. Cool. So we're going to cover a bunch of the techniques behind this, including how to run the tokenization, how to represent the components to go with the sentence, how to build these language models. And this week, working up very much toward context sensitive embeddings like BERT that give every token a different embedding, depending upon the words before and after it.