 Before we turn to deep learning, let's look in a little more detail at what goes into the actual language processing. And the key here is representation. Now the simplest version is to take words, like avocado, and do a context-free embedding, like word-de-vec, or tokens in context, like I ate the avocado, and do a context-sensitive embedding. And people often do that. But it's becoming increasingly common to use subword encodings, encoding a piece of a word, the most common of these being the byte pair encoding, or BPE, and in a subword encoding, the computer automatically finds frequently reoccurring character strings, subsequences, and these are treated as the fundamental units that are embedded. And this helps a lot because lots of words are misspelled or have never been seen before, and you can always then find embeddings for them. So for example, if one saw the word B-B-I-B-T-E-C-H, BIMTECH, this might be recognized as three subwords, the first one, B, the second one, BIM, which would be embedded based on its frequent context, like a bibliography or whatever, maybe a BIM, like a child's BIM, it's ambiguous, and TECH, which is great because that shows up like technology and technophobia and technical. And so by embedding each of these subwords, these substrings, one can now have embeddings that get any word whenever it sees. Finally, lots of representations used in the advanced systems like BERT, which we'll see later this week, look at a number of special tokens that are added to the text, a separation token that's put between each sentence or between each string, if you're concatenating a bunch of tweets, you put a SEP token between each of them, an unknown token, if you're not using a subword encoding, you need some way to encode a word like BIMTECH you've never seen, replace it with an unknown token. If you're doing padding, truncate pad, remember, don't put in zeros, that's a zero, put in a special token for padding, a pad token, and finally when we do BERT, we will have a special token that's at the beginning of each sequence, the beginning of each tweet, the beginning of each sentence, a CLS token that will represent in some sense the entire sequence. So income characters, outcome words, or tokens, or subwords, which are then encoded into vectors we can use for deep learning. So typical NLP pipeline, first of all, you tokenize or extract the byte pair encodings and I'll use those terms sort of interchangeably, then you map these tokens to some sort of an embedding, either context oblivious, context free, like WordDeVec or context sensitive like BERT, these are trained on huge billions of words corporeye, and then given that map of embeddings, now we will train a neural net with those embedding this input, and finally, for many of the applications, the mapping from word to embedding is part of the neural net, and we'll use gradient descent to fine tune, technical term, to fine tune these embeddings so that they do a better job of predicting the actual outputs one wants. So initially train vector embeddings on huge data sets, then use gradient descent to adjust them so they're a little bit better on the problem that you want to solve.