 And then you also want to record. Okay, let's go back to zoom. The core cloud. Share screen. And then you also want to write. Okay, let's go back to zoom. Okay, you're not me for the browser. You need to mute. You want to go and mute. You want to go and mute. You want to go and mute. Okay, now you're good. This access. Oh yeah, I've got it like you're like the mobile one today. I'm gonna stay with this. Okay, let's go back to the first half. Testing. Nice. Hmm. Oh, that's one. Oh, we're trying to find my thing here now. Now we're good. Okay. Technology rules. Maybe. Bring my advance. Okay. Remote people. Do we have sound? Okay. Okay. Good. People here. Good. Cool. So time series this week. We'll talk a bit about some of the things we're doing coming next week are transformers. Okay. And GPT three, which we'll do last time. First of all, we should all have gotten feedback on your projects. If you didn't. Slack us and do us. Email us whatever. Secondly, we're trying to get AWS credit. We're trying to make our projects. Everybody's good. Cool. So let's make this thing get out of the way. Yes. This is not working. Oh, that's not good. This is supposed to do here. Okay. Technology sucks. Okay, keep going. I will stay here for now. Cool. So this week is time series. And there's a bunch of different ways to look at it. I want to sort of quickly review them and ask you about them to think about them. And the idea of time series is we have a sequence of observations. I think the observation being a word. I'm a language guy, but they could be a sequence of images. Like a video. Or a sequence of measurements of a power plant or anything. And the general model is to take the. Observation like the word. Map it to an embedding. The fancy word is the context oblivious embedding. So it doesn't depend upon the words before after. Or for the image, take the pixels, map the pixels to an embedding. Or you can use a standard piece and then learn something like, for example, a recurrent neural net for this week, right? And the idea of each of these is that. At each point, you have a neural net, which takes in two inputs. It takes in the current embedding of the current word. Or image. And it takes in the previous output of the neural net. The state, right? So it takes in the embedding of the word to. And it takes in the state from before. And it runs it through whatever you want, a bunch of values and gives an output. And then at each time you train it by predicting. What the next word will be supervised unsupervised. We have labels. So maybe there's unsupervised, there's no wise. Or maybe it is supervised. You consider the next word, the label, right? And there's a fancy word for this kind of supervision. Self-supervised, right? So standard in time series is to use self-supervised learning. It's not unsupervised. We have lots of self-supervised data. Who knows? Apparently, we've got to stay close to this. I don't know which microphone it is. Cool. So questions. And this model is now called a language model. I'm mapping from an arbitrary sequence of words to a probability distribution over the next word. Good. Okay. So we got basically three uses, which I'm going to go through and remind you there in the, I've stolen the slides from the worksheet. Word tagging, sentence labeling, and generative modeling. In word tagging, we're going to take each sequence of words. And then, whoa, no, no, come in here. I'm just taking technology. Yes. No, go away. Sequence of words. Really doesn't love me. I want this to go away. Yeah, maybe I can move it to the left side. There we go. Now at least it's smaller. Okay. So we're going to take each word. We're going to embed the word with a context oblivious embedding, right? A word to vector mapping will feed it to the neural net. We will then take the other piece. We'll get a new output, which is what we usually call, what we call the output of a neural net inside like a hidden Markov model or a common filter. It's a state estimate. Right. The state estimate the output of the neural net after I've seen, I went to the output here is. A vector that describes that sums up everything we know about the past. That it's likely to be useful for particular future. And that assumption to make that work is the one from 520. Who's the Russian dude. Mark. Yes. Right. The Markov assumption, right? We assume the world's Markovian in recurrent net. If you have a vector here, the past will tell me nothing more about the future. This gives us a different kind of embedding. Note there's two different embeddings here. One is the embedding of a word like went or to or I that maps from a word to an embedding. The other embedding maps from all of the past to something that summarizes the past. Right. So that's embedding not just of two, but of the whole history. That's embedding is called a state. Or a context sensitive embedding. Because it depends upon the context. In this case, the left context. Make sense. Now we can take that state vector and run it through a neural net and predict a label. Note this is a semi supervised learn, right? Lots of self supervision to run to learn the mapping from a sequence of words to a state. So this is a whole lot of labels from state to. Is it a preposition? Is it a verb? Is it a pronoun? Is it a conjunction? Good. You can also take a whole away before I leave that. This is all left to right. If you're actually doing something like a speech recognition, you mostly want to go left to right. You take the past up the current thing, take the speech and then produce the words, and then you're done. You don't want to do that on your screen. But for say labeling parts of the speech, you might want to use the right. Context as well as the left. How would you do that? Bidirectional, right? You have the exact same thing instead of running a, in addition to running an RNN from left to right, past the future, you run a second RNN from future to past. Right. Mathematically exactly equivalent. Assuming, of course, that you know the future works great for you. You can also take a whole sequence and take the state at the end of it and then use that and label the whole sentence. What's the topic of the sentence? And one can also take the whole sentence embedding and feed that into a second neural net. So the first neural net called the encoder takes in a sequence of words and produces a vector, right? That best summarizes that sequence of words. So we feed that into a second neural net, which is trained to produce words. And it then goes and, for example, translates them. Just not how current translation works. We'll come in a second to why that is. But for a while, these were used. And they may be used again. There's a lot of argument right now in an LP as to will transformers be the right thing for all future or will recurrent nets come back again? Questions. I'm going to start asking questions. Okay. So let's think of a bunch of problems. The simplest one. We got a big image. And we want to look to find the dog or a computer or something in it. Right. Ask the question, is there a dog in this image? If so, where? How do you code that up? CNN is the obvious way to do it. So how does the CNN search through an image? Oh, it goes sequentially. Does the CNN really go sequentially? This is going to really matter for as a computer scientist because these things are computationally really expensive. So sequentially in the sense you have a whole bunch of filters at different locations. But let me ask again, does the computer compute sequentially? It's parallel. You can do the whole thing in executed in parallel. You can take all of those images, all those little filters all at once and slam them in. And we did some calculations earlier when we did CNNs that said, hey, you want to have enough memory on your GPU chip to do them all at once. Does that make sense? So sequential is probably the wrong word there. Does that make sense? Yeah. The same filters applied over lots of locations. So conceptually, you can imagine moving the filter across the image here, here, here, here, here and doing each of them one after the other. In practice on modern deep learning, it does all of them at the same time. Why is it so important to do them at the same time rather than sequentially? Faster, right? If you're going to check 100 different sub filters, it's ballpark 100 times faster to do them all at once than to do them one after the other. And in fact, you're 100 starts to really matter for these things, right? They're slow. Does that make sense? So mostly in CNNs, it's run in parallel. You have a bunch of windows. If one object could be big and one could be small, it could be a small dog or a big dog. How do you find a small dog or a big dog in an image? What assumption do you make when you design the neural net? The size of the one size invariant, you have to put in filters of different sizes. Could be a five by five. Could be a nine by nine filter, right? So you're sort of assuming what size they might be. Now, say you wanted to find a dog in a video. I'm going to stream a video. And we're going to have lots of images, thousands, millions of images. Hey, videos could be long. How are you going to find a dog in a, or all the dogs in a video? Look in each image one by one, maybe. Maybe. Is there a way you might get more information than looking at each video clip one by one? Take a timestamp of a snapshot of the video. Sorry. So instead of sampling continuously, you could actually sample periodically. Yeah. So maybe the motion matters. Maybe it's not just the image. Can you recognize a person from behind by their gate? Mostly, yes, actually. People have dogs, have fairly, dogs compared to people, fairly characteristic dates or motions. So maybe for video, you don't want to look at a single video. You want to look at a sequence of images. Look at changes in them. Maybe the dog is partly there and then partly vanishes and probably shows up again. And you never see a whole dog in one image, but you see a little slice of a dog as it goes by through the window. So there's lots of reasons you might want to have a longer stretch. Can I put the whole video in at once? The same way I put a whole image in at once. Not going to work, right? Videos are too long. And what does it say? So find, identify frames with lots of changes and then look for the ones that have lots of changes. Make sense as long as the dogs are moving, means you're not going to see a fixed dog. Are people better at recognizing moving dogs or fixed dogs? I show you a still image. Where's Waldo or an image with Waldo or the dog moving in it. It's really easy to identify things that are changing. You can see the dog moving from it. You can see that your eye is drawn to things that change. So it's often a good idea. You'll miss that sleeping dog in the corner. So if you've got a sequence, you're not going to be able to do the whole thing all at once. You're going to have to do something that either, but what are the two alternatives for architectures? I want to recognize a dog, a moving dog in a video. In a deep learning style. I don't want too much pre-processing and feature recognition and finding where this moving. I just want to put the video into my deep learning thing. What kind of architecture could I use? What's the input? I need an X going in and the output being dog no dog. I know that remote people can't see this, but image, time one, image, image, image. Got a whole bunch of these. Some of them have dogs in them. So R and N is one way to do it, right? So the recurrent neural net says, hey, each of these images, I send an embedding of that image in. I then have a sequence here. I use our current neural net. Makes sense. So RNN is the obvious thing to do. What else might I do instead? What would you have done before RNN? Yep. CNN, how do I do a CNN with, with, I could take my CNN be an input, which is a tensor. It's X by Y by RGB by the number of items in my window. What's better an RNN or a CNN? Oh, anybody who's taken a 520 with me always knows the answer. It depends, right? Right. So that's what I just said. Let's try and look at the costs. The cost of the computations. And the one question is, what's your sequence like? How many images do you have? Right? You've got N images. And in a recurrent neural net, you're going to have to do N things sequentially. First image, second image, second image. That requires a whole bunch of things one after the other. So if you want to do N things sequentially, just to predict the N minus first image, you got to have the N minus second N minus third N minus four. That's an order in sequential operation. One after the other. This is the reason that people in my world, which is NLP don't use recurrent neural nets anymore. They don't want to do N things after each other. They want to do it all at once. Right? The convolutional neural net. The whole thing goes in here. No recurrence. Sequential operation one. Cool, but we're in computer science. No free lunch. What's the complexity per layer? Right? And if I have a representation that's of size D, my embedding of my image or my word in size D, it might be a 300 dimensional embedding or a 600 dimensional embedding of my word of image there. Then a recurrent neural net is going to take something that requires N times D squared complexity per layer. The convolution network depends upon the kernel size. So the K, how big this is, is going to take more like K times N times D squared. So convolutional requires more memory per layer for time. But no unwinding over time. Does that make sense? You can be parallel, you can be sequential. No free lunch. In the current computing world, what do people mostly do parallel or sequential? The parallel is winning at the moment, right? If you have enough memory on your GPU, then you run something more like a convolutional neural net or a transformer, which we'll cover next week, which is in fact a auto encoder that works in parallel, not sequentially. So transformers are replacing currently at places like Google and Baidu, Facebook, and they came in NLP now and also in vision. They're replacing the recurrence because it's faster. You can do it in parallel. Difference? If you're doing recurrent neural net, what's you have to pick in terms of how much history you keep? Nothing. Recurrence goes back forever, right? Whereas for a CNN, they're going to have to pick a window. How big a window do I care about? Cool. Okay. Here's the different time series. This is an EEG. As an input for the CNN, what we're putting in is all of the word embeddings or all of the image embeddings within the window in one big input. You learn from the whole length of the video. Well, each window, you're going to either predict what comes next for your language model. So that's the next word, the next embedding based on the presenting ones, or you're going to predict some sort of a label. So as you have a CNN for a time series, at each time point, you have a window's worth of history, K proceeding ones, and then say you predict the next, next word. Makes sense. Over and over again, but at each point it's one shot. Boom. In goes the last K words. Out comes the next word. Repeat that each time. So there is no sequential in the sense of having to. I can't just make this output. So I know the one before it, which doesn't report, trying to report, this can't be generalized. Right. A sequential one, if you have to actually compute going forward, you got to do this, then this, then this, then this, then this, then this. In a real sequential time series, like a hidden Markov model or a common filter, there is no way to predict one shot. It requires sequentially multiplying each time. Does that make sense? Very important point. It's worth covering both recurrent neural nets and transformers, because depending on the decade in the architecture, one is better or worse. Right. There is no one that uniformly dominates. All of that said, transformers are coming increasingly to win right now. But. Good. Okay. EG. You want to find a seizure. This is what the seizure looks like. I'm told time series bunch of paths. What sort of a model do I want? CNN RNN. Yes, kind of. I couldn't have seen this one. I can play. You're going to play too. Yeah. Oh, nice. Well, I don't know. I mean, that's a perfect question. Yeah. So, stuff happens at some point of time. Like a long sequence. That's a really long sequence. So we have like, basically forever, ever doing on nothing happening. And then stuff is different. What would that mean? Or something else. Okay. So CNN. Get here. CNN because you're looking for a spot where it happens. The long history doesn't matter. Yeah. Yeah, we got two votes for CNN. And in fact, the winner is CNN. CNN, what a surprise. There we go. But I want to say it might slightly depend on how you want to ask that question, because there's some people that believe that before you have a seizure, some things just so very slightly changed so that they could tell you. But there's some people who have seizures, like hours ahead of time, like, it's not a good day to do something differently today. So if we take that view, then CNN sounds like the wrong idea. We want to like see that something subtly changes ahead of time. I might want to, or I might want to do something. I've been playing with using a CNN with a short time window and then something that summarizes the data and averages it and gives features over the much longer journey. I agree. Often you want to know what's happening in the preceding two days and the real thing is not to tell when you have a seizure, which is obvious. I'd like to know the day before. But like that strategy that Lyle just sketched, I think it's very important. Like in the context of CNN, if we want to like basically build back in this long time scale, the way we can do it is by having an idea into the domain and what other features we want to use. So that then highlights when we'd want to use like this strategy versus CNN strategy. If we are totally clueless with the right features, what's our choice? We don't know how to summarize the past. And CNN can basically learn how to summarize the past. And lots of the projects, people say, how do I do something innovative? And I think innovative almost always means understanding something about your world. And it turns out that if you want to project on EEGs and you have no idea about them, it's really hard to be innovative because you don't actually understand the time scales on which things happen and the different questions you might ask. And if you do a Kaggle competition to recognize the EEG when the seizure was happening, you're missing the stuff that's actually interesting. These guys are like 98. something percent accurate. The problem is boring. But will I have a seizure tomorrow? That one's still open. That's an open question. And it's highly relevant. Brian let his doing that exactly. Yeah. Yeah. And they basically want to know, can we put a stimulation into the brain to stop the episode before it's happening? So they would really very much like some code that tells them what's going to happen. Cool. Very good. Okay. Looking forward to our next week. You want to answer a question on a Wikipedia page. What year was Benjamin Franklin born question? Long page. We need some sort of attention, right? That's going to be the big piece here. So let's sort of review what attention looks like, which is sort of the biggest hardest concept. I think in all of NLP. So we have, we have a sequence of words. Here we have French. Yeah. Each word is. Yeah. It's French is good. It's, it's yeah. Well, get there. To embed each word with a vector, you give it to an RNN, you have an encoder. You want to then if the sequence to sequence have a decoder that takes each of these words and takes the embedding. And then he hit me with a and predict the next word. But for a long sentence. You'd like to not remember the last whole book by Camus. It's actually as you go along translate. Make sense. So when I predict the next word, I want to both take in the current embedding of where I am so far in my output. And all of the potential words here to see which of these words are most relevant. And mostly there are a bunch of ways that they're in the worksheet of finding similarity or attention. But you can think of taking a dot product. And that is the embedding of. On top day to the embedding of the sentence up to here. How similar is the word in bedding to the embedding of the sentence. And what you'll find is that different English and French that are sort of similar in their alignment. Often the last word is most important. And I just want to emphasize the cool property of French that hit with that pie is a single one. Yes, there's not a one to one mapping most people in this audience or multi-lingual. They know that one word English, maybe two or three words in India or Chinese or maybe vice versa, right? So yes. But hitting with the pie as a word, that's something about priorities. We're missing in English. I know it's terrible. Why is there no word English in pieing? Pie-facing. Pie-facing. I, you know, someone asked that one. I don't know. Question. So note the key point that every word in the input, every vector here gets an attention score, which is all learned. Everything is neural nets. And then it says how similar in some sense is each of these embeddings to the one I'm looking at currently. And then combine them together. And in a way, it also mirrors the way you would probably do it. No, like if you answer the question, you'd find where exactly is that? The attention score says how similar is the embedding of this word to the embedding of this word here, this word in context, right? The state to that one. And it's telling you how relevant it is. As you're looking to translate, find the next word, you can imagine a long text and the stuff that you heard many, many sentences ago is probably not relevant. And the next word is probably highly relevant. And the similarity in the embedding space tells you relevant. Note that sometimes things a while ago are relevant. Last night I watched Clinton. She gave a great speech. The she is going to be relevant to Clinton. And you need to know that languages differ in some languages like English. Clinton, she, Clinton, he, Clinton, they are all different. In lots of other languages you don't. So if you're translating from a gender-free language where there is no she, he. Now you really need to figure to translate English. You got to find the gender of the pronoun. Was it Clinton? Was it someone else? Oh, crap. And maybe that Clinton wasn't relevant because there's two Clintons. There's a he Clinton and a she Clinton. Maybe in fact you have to go farther. And it says Secretary of State Clinton. Which is definitely a she. So often you have to pay attention somewhere fairly far in the back. To disambiguate. Although most of you pay attention to almost the most recent words. But I mean like there's another side to attention scores as well where we can say just like. Functionally here. What's the attention score doing? It's basically opening the floodgate. What are the things that are going to happen? It's basically opening the floodgate. What are the things that we should let. And both of them are there. Like at the same time. Like both of them are meaningful interpretations. That makes sense. The magic is that similarity of the embedding, the state embedding of this. And the state embedding here. And there are a couple of different kinds of similarities, but cosine is the easiest one tells you something about relevance. And note that everything is neural nets all the way through. Still trained by gradient descent. But of course the encoders are trained to make it. So that that similarity metrics in encoding space. Approximates usefulness for answering the questions that we want to answer. Sure. Because we're in gradient descent on predicting pie. And for pie, if you got to look the entire day, if you don't get that one, you're not going to get it. Highly relevant. And in fact, if you're really cool in a modern version, instead of words, we'll see next time people use words, subword embeddings. So you might actually have on and taught or taught this English tart pie and a the past tense. So you might actually in a modern version, instead of having words going in, you would have subword or bite codes, chunks of words. And you'd learn that the word taught is in fact. Time to pie. Right. This makes sense. This is this is sort of the key thing that goes away from the really local reinforce the RNNs that don't remember much about history. They can't remember a whole long document or something further. Cool. Now we could instead of that drop the sequence to sequence. Take a sentence and actually put in some query. Ask the question like who threw the pie. Or what was thrown. And now you could take an embedding of the question. And I say how similar is the embedding of my question, the query, we call it the query or question, the query to each of the regions of the contextualized words in a whole long document. And in general, who threw the pie is going to be most similar to, well, I gotta be careful to be similar to the pie is similar to on top day. But the who is to be similar to the, you know, the he. And in a learn system, it now learns to say, given a query. Embedding. Find the most similar part of again, a long document. A book. Find the most similar one. And then pull that out and learn, ah, this embedding is the one that matters. Send that through a decoder and say that it was key to it, which of course requires now to see more context. Who is the he, which I don't get in the short sentence, but presumably there was some, you know, Joseph or somebody to it. This makes sense. Same idea, but not sequence to sequence query to document. Good. Now one more final one. Oh, no, it cost first. Let's see. Do I have the other one? Um, costs, how much these things cost me one more concept. Before we do this, which is self attention, which is like cool. So when you're actually training a transformer, or a modern neural net, you cannot only pay attention to a outside query for each word, but for each word in the document within a window, how similar is that to all the other words? That makes sense. So instead of paying attention with an outside query to each word, for each word in the sentence, how much, how similar is it to each other word in the sentence? How much attention to new pay? How would you find if, you know, when Clinton was visiting Iraq, she did X, the she will tie closely to Clinton. Right. Because in fact those pieces will be similar enough. And in terms of predicting future words, right, it's still self trained. That's going to help. Does that make sense? Instead of a query and a document, you can take each word in the sentence. How similar is it to the other words? By the way, how does that scale with the length of the sentence? This is really obvious, but important point. I got a sense of N words. How many self similarities are there? Spallpark N squared. Okay. It's quadratic in the sequence line. So what does that mean if you're doing a transformer? If you have something. Well, how big can you go reasonably 10 squared? No problem. 100 squared. Fine. 1000 squared. Okay. 10,000 squared. Okay. So how long can a transformer reasonably work? Or magnitude. A thousand. I mostly use 628 is my standard birth size, but now we're crippling over details. But again, quadratic self attention. Okay. So. Complexity of self attention. Simultaneous. There's no history. It's not a recurrent neural net. We unroll it piece by piece. You look at everything at once all pair wise connections. How much does it cost you? N squared for the sequence length. Times the embedding dimension from the static imaging. Or if you want, you could restrict it. And instead of doing the whole sequence at N, normally going to say, Hey, I'll take 628 words or a thousand words or something. And then I'm going to have something that's going to be. Typically a thousand. Times N times D. So you can pay some cost again. But you get something that is now quadratic in the sequence length rather than quadratic in the embedding space. Right. For the embedding of the word embedding the image. So, so I'm always confused. I know such a transformer net would say deal with the triple negation. The best sentence. It's a lie. If I tell you that it wouldn't be true that Conrad is not the professor. Yeah, given that it's basically quadratic rather than the cubic in the. Yes, it feels like it feels like there must be limitations. And of course there's ways of like maybe avoiding them through that. I think the answer is first of all is that humans are quite bad at doing things that haven't centered in bedding. The cat the dog eight ran. Killed. I can't even say it right because I can't actually do the embedding. So people aren't really good at these either. Is the first piece of linguists say it's physically possible. The other piece I think is that I haven't tried those when you get the, the triple negation. I don't want to see while many languages is actually non-triple negation. And English technically is for standard English. I think what happens is you got to think of each piece being embedded. So the attention then tells you these negations all related pairwise to each other, but we're going to multilayer neural net here. You're going to see in general what's going to happen is you'll get something that feels like a CNN, although the number is global. It's linked and words in parallel is good connections. What's still going to happen is you might have a 10 or 12 deep neural net. Sherry standard. And now it's going to happen is the first layer is going to have I didn't go and it's going to bed that. And it's not true that I didn't go well then get embedded there. And then I lied when I said that it's not true that I didn't go. That's going to get embedded. So I predicted layers eight or nine. So you'll start to see that. Pieces playing together, right? So, so if in a linear world, you're guaranteed screw. It won't work. But remember that all of these are done in moderately deep networks. We're talking sort of six to 12 layers typically. So that allows the triple negation as you go each layer, you can get one more layer of negation. Cool. Fun. Questions mean the self attention was such a breakthrough. And I think it's not obvious. It wasn't obvious to me. I find it very confusing when I first read it. I did too. Okay. So again, why are people using transformers mostly rather than recurrent neural nets for language today? Take away. The answer is on the slide here somewhere. Sorry. Higher. Parallel is a right. The recurrent neural net. It means you've got to do in the sequential operations going across things. The transformer is done in parallel. That's one thing I didn't. One other thing I did not say. The recurrent neural nets are self supervised predicting the next word. Transformers are mostly trained. Like an auto encoder. Take 10% of the words. Blank them out predict them. So instead of sequential, it says take words out of the middle of these M words. Take your 600 word image, your sequence, drop out 10% of them. Right. And the six words and now predict the words in the middle from the other ones. And the self attention helps do that. Because it says each of my removed words is best predicted by some of the 600 words. And in general, the ones that are physically close will make more sense. But not always. How long can word dependencies be? I went to a coffee shop last Tuesday with Conrad. We talked a bit. We discussed both coding languages. And what we like to drink. And his favorite Java is actually, oh dear. That was a bad one. I don't know if it's Java. His favorite coding language is not Java. That's not Java. I can promise that it is not Java. Okay. So note that it could be a long contact or it could be his favorite beverage is. Yeah. And who is he? Is it the I or the Conrad? So note that there's some of the signal that goes back hundreds of words. And in your example, yeah, both what do you mean with Java and what do you mean with he? Long range dependencies. So the transformer really does help. It gives long range dependencies. It also means that when you go to generate models, like the sequence to sequence, you can predict although you're predicting one word ahead, it predicts things that give you long sequences of reasonable predictions. What's surprising now is that vision, that computer vision now all of a sudden has transformed us. Yes. Computer vision is going to transform. It was never supposed to deal with these things. We know. We told you already CNN's the right revision, but I watched the vision guys getting more and more transformer. Yeah. It's coming. Transformers eat everything. This decade. Yeah. By the end of the decade, we're in two years will be the last content. Okay. No, no. Okay. But like, look, there's a way where you can view transformers as a generalization of conflicts. One namely where instead content, no, with contests, we basically decide to use the same folder everywhere. Right. But why? Like it kind of makes no sense to burn our compute on like doing local operations that are pointless. But the reason you do the assumption is you don't have enough training data to actually is the top of an image in the bottom image typically the same. You got a flicker or YouTube. No. My PhD thesis is like fundamentally different. Like, like, like we put a camera on the head on the head of a cat that runs through the forest. We did then the statistical analysis of what's above the horizon and what's below. It's totally different. So the idea of making the whole image convolutional that embodies something that's wrong, namely that the world is the same all over the place. It was super. They didn't have the computer power to do it. Now we have a more computer power. We can relax that assumption a bunch. That's, that's right. And in that sense. Con that sense. Transformers might as we go towards larger, more compute. Transformers might end up eating the content. What's the big difference between Alex net, the classic CNN and a transformer in terms of type of learning? What is Alex net, the original CNN, what kind of learning is it? Supervised with labels. And we can never get enough. Never have enough labels, right? Even Google is available. What's the transformer take? It's an auto encoder. It's self supervised. It doesn't need the labels. The computer power. It's an auto encoder. It's self supervised. It doesn't need the labels. The vision guys now, unlike 10 years ago, have the GPU power to actually run big on labeled data sets. Yeah. And what do they do? They take an image, they put holes into it, and then they ask their systems, tell me what happens in the hole. And it's a very well defined system. In fact, it's exactly equivalent to the way we train transformers. And that's why they move into the direction of transformers. Right. Yeah. RNNs are also used to predict text today. The big most of the commercial products like Google are transformers. They're playing a specialized product using RNN. Oh yeah. There's two different questions. One is self supervised. So you can either be self supervised by predicting the next word in a sequence. As an RNN or self supervised, like an auto encoder, like a transformer predict missing words. Then there's that's one question. That's how the label can predict next word is missing word. The other question is self attention, which actually is a different question. Self attention says, I'm going to measure the similarity between each word and each other word in my n word sequence. I have n squared. Every word I got a similar to all n minus one other words. Makes sense. Which works well with the missing word one. Because now I say how much attention to what other words shouldn't be used to predict the missing one. When self attention and self supervised have equal performance predicting the next word. What did I say about the accuracy of transformers versus RNN so far today? No, did I ever mention accuracy? You did not. I didn't ever mention accuracy. I said transformers were better. How? There are only two things that matter in computer science. Right. Accuracy and speed. Transformers were better because. They're faster. Why does that make them more accurate? Doesn't make them more accurate. And what do I get? Why do I care about the faster? How does that lead to more accuracy? You can give them more data. If I've got half a trillion words of English. I need something reasonably fast. You can easily burn through $100,000 worth of DPU time. Training a model. This is going to be hard for you guys to do at home while you're co-ed. You're like so in 2019. You can burn through $10 million for training one model. Well, the 10 million is the list price. If you look at the amount that Google's actually paying, it's not 10 million. That's how much I would pay if I were doing that much computing. If I were doing that much computing, I'd get a quantity discount. So those are little bogus numbers. But we're crippling over. But hold on. I think there's another thing I want to say about the question. Like one part of the question at some level is about the cost functions that we want to use. And our goal there is we basically, we have the data sets that we have and we want to get as many bits of training data out of that as possible. And we only get so many labels, kind of labels. Like on the grand scale of things, the world just doesn't give us enough labels. Always expensive. So therefore if you're one, like as we go into the big data regime, kind of all the models switch to instead like predicting the data because you just don't have enough labels. And the other one is about an architectural feature, which is self-attention. So those two are kind of orthogonal to one another. Like we want to build good architectures, including self-attention and so forth. And we also want to have good cost functions. They're synergistic in that case. And I think that is true. It's slightly the case that the masking and dropping out 10% of the words plays well with the architecture of self-attention, because the self-attention says, hey, great. Predict this word. Note that it's also unlike an RNN, which is left to right, more priority, local things. The transformer is, here's a whole block, everything in the windows equally good. And that then requires a self-attention to get it to work really well. Otherwise you have to use a smaller CNN. So it's a question of how much do you assume locality of effect? The RNN says things exponentially in the past are exponentially unimportant. The transformer says everything in the windows equally important or learn it. And if it's outside the window, I'm lost. I don't know about it. So for the transformers, it's actually hard to learn to be local. For the RNN, it's hard to learn to be not local. Right. Makes sense. Now the nice thing is that the attention does eventually learn to be local. And one thing they do in transformers is they put an index in it. So each word in your thousand word sequence gets a number one to a thousand or some sort of encoding of it. So they tell it actually something about sequence information. And thereby they can learn it. So they can learn it. Okay. Well, transformers only work when you have the past on the future. You can still use them in that case. And in fact, things like GPT three predicts the future only based on the past. So it generates new pieces. So you can still train them. Usually you have past and future training time at runtime. You often only have the past. So it's very common. You have a sequence of words. And use a transformer to predict what will be the next word or a sequence of sounds for speech recognition. What will be the next word that comes out from it. But don't the transformer embeds a bunch of frequency spectra to predict the future. And then you can use that to predict. Very small supervision. What will be the next letters that show up for those sounds. Make sense. Good question. Thank you. I did mention that attention does not have as much locality as RNN. RNN is when you're making a prediction. All that matters is the current measurement and the output of your seating layer. So it's got a Markov assumption. Everything here. Attention. Everything in the window. Contributes to everything in the window. So the locality for the transformer is the window size. 10 years ago, that would have been hopeless. Because. Memory was too small. Now we can afford your reusable window size. You can put in. 600 a thousand words each with. 300 and 600 dimensional betting that all fits in memory now. So now you can afford to have a window size. That's many sentences long. And do the whole thing in parallel. So again, as a guy who's not an architecture guy. It astounds me how much of my machine learning stuff. Is driven by computer architecture. Yeah. Students off. I'm an NLP guy. Why do I have to learn architecture? I go, you don't have to unless you actually want to run any real data. And say, parts of the reasons why Alex net was so popular is because they started taking hardware seriously. So that's where their big jump in performance came from. They basically optimized everything, including the fact that they have two graphics cards. Yeah. I wanted to say one last thing about this. If we have an arm in that steps. At any point of time, we have a summary of the past. That summary of the past. In principle could be anything. That summary could, but it's going to forget the distance. And the sun assumptions. Not like, like it could be. Tap the line where we have like up to some time. Okay. So, so, but basically. So even if there was an infinite amount in there, and it could in principle do that, we still have this inductive bias, which is that the gradients that we have there, basically make it that everything that is really far. We'll kind of see really, really weak gradients. So the training procedure, very strongly biases on instruments. You basically tell you something about the recent post. And the reason why you will use LSTM's for a long time, where we used LSTM's here. Is because that's somewhat limits the effect. Not as bad, but still not as good as transformers. Because the transformers, everything is directly connected all the way through. It's much easier to learn the gradients look much better. You can say. For the transformer. It will learn like stuff that's thought of. It's less important, but it's not that it has kind of like to learn it in a very difficult way. Yeah. There's no one that device. But to know that device is different. Yeah. Cool. Questions on this. So these are like, this is the core of contemporary. NLP. The other thing to note is that you can actually do both at once. So you can also have a return. Underneath the hood for the representation. And then a transformer on top of it. And so you can mix and match these things. So I've tried to make them conceptually clean, but in practice, people often. Merge them together. Where the rep where you have a transformer style connection with an underlying current neural net model. You don't have to pick one of the other. Cool. Okay. So yeah, one more. So another thing is popular these days is to ask a question. So here's an image as an input. And the question is what are sitting in the basket. On a bicycle. That's a weird question, but that's the question that I stole. How would you know what answering building a computer model to answer questions about images. Like this. What is that? I thought it might be a goat or something fun, but no, it's too dog. It's too dog. The answer seems to be two dogs. Conrad has got the answer. So how. And what. I only see two dogs. Two dogs. I think. Okay. But the question is, how would you know what answering building a computer model to answer questions about images like this? I don't know what the question is, but the question is not what the answer is. The question is how, what architecture do you set up your deep learning model to answer this question? What do you think you do when you solve that? No, we had some trouble, but like what happens? I want something on the image, which will be CNN like in the current architecture, although maybe not next year. And at RNN or a transformer on the, on the language. That's actually picture of bigger image, but even the small one. And I want to push you on this, like, what, what, what did you do? Yeah. Because I think that's something very important. What did you first do? Look, look at the picture or read the text. Read the text. So you started with the text, you embedded the text. With some information there, maybe. Okay. Then I kind of know. There must be a bicycle with. A basket on top of it and something sitting on it, which kind of has my big question mark on it. And an embedding of kind of broadly that visual question. And then what did you do? We've been asking this question again. I need to connect the text to the image. I need to connect the text to an image, but a picture of a dog is spelled dog. So half of that is correct. So you want an embedding of the language to match the embedding of the image. But I want to embed all of the language and all of the image. I could take the whole image and use the. An ultimate layer of auto encoder or something and I have a thousand dimensional image embedding. Yeah, there's also a red bike. There is a red bike. Yeah. Shouldn't we be looking at the, how would we embed the red bike? That's a check question. There's a hint. What are we even talking about the last half hour? What's the one word summary of the last half hour? It wasn't complex. Wasn't even transporters. Attention. Attention. Right. So as you look at the image and people do the eye tracking, it's great. They can actually monitor. And when you read this art, this, this question, you will see people first looking at the, at the bicycle, then looking into the basket, then looking above, then their eyes will bounce back and forth at the dogs there. People will attend to it. You can't actually take in a large image all at once. I can't see this whole room without scanning across it. So you actually will shift apart them. So typical sort of model. Take in a question. Do some sort of a CNN or LSTM or something that does embeddings of the query. Take in the image. You can sort of CNN, but note that maybe within the different image, there are different pieces depending upon where you're looking in the image. Now you want to match different pieces of the query, like the what and the sitting and the basket and the bicycle to different pieces of the image. Right. So bicycle embeddings should match. Well, this one and that one. Basket should match that one basket on bicycle is sort of this one. So it's going to have to match something that sitting, although sitting isn't quite there. It's going to be contextual, but somehow the point is that for a large query and even a moderate sized image, one wants attention both over the words and attention over the image. Not all of the image is relevant to the question to try and take a whole image and capture it in a single vector. So this is too much of the information. There's too much going on. If I asked you what's to the left of the basket on the bicycle, then I would see the red bicycle. Right. What's behind the bicycle? I don't know. Looks like a chalkboard almost. It's a window. Okay. Hard to tell. But note this fact that it's tension. There's a picture of a dog or picture Conrad or a picture of a piece of chalk. And you say, what is it? That's nice. But that's not reality. We'll talk about this more next week. Lots of things are not multimodal. You have images and text or sound and images and note that you need attention typically on the images and on the questions. In the old days. Oh, like the stuff we covered three weeks ago. There was these really funny notion that every picture is a piece of chalk. That's not reality. Reality is there's Conrad and a piece of chalk. And lots of other things, right? And if you look at how people or computers look, the people sequentially eyes bounce around. You look back and forth from you to Conrad. You look over there, right? You'll be focusing on different pieces. And depending upon the prompt, the query, the question. When I say chalk. I'm going to focus over here more. Makes sense. And computers are subject to the same sort of computational limits that we are. Abstractly, we can cripple over the details. Conrad is not going to argue too much. Therefore. It needs to answer a query to look across the image and find the little pieces and try and say, we're in the image. What am I looking for? And how to do this very hot right now. Lots of people doing different models. I also want to just. Like, like there's something very obstructed in the top layer. Like feature vectors of different parts of the image. There's like a lot of magic happening in this little arrow coming out of this C and into it. And I like, what does that mean? Like ideally we'd like object based attention, which people have. I can look at you or look at you and in a way, like my world is only covered by you at that point of time. And in a way, we do that in a very ugly way still. And in a way, it should go back. Not like it should go back to the left where it can say, show me the basket. It's almost like I want to like go all the way to the pixels. Like what's the parts of the basket? Yeah. To answer good questions about the basket. I think the bottom, I picked this because this is a state of the art paper by a team, which is going out of small, I respect. The current state of the art is we're working on figuring out how to do this structure, this attention. So what Conrad wants is things that will be showing up in the next couple of years in this community. But it's still fairly primitive in the sense of not really having hierarchical structure of not having prior knowledge of objects of not being bidirectional. Right. So, you know, I'm not going to go back and forth. In fact, you go back and forth between the language and the image. And you reinterpret the, they're great psychology experiments. Where you say, you know, look at the frog on the napkin. And as you see the picture, you reinterpret. Is there one frog? Are there two frogs? Oh, the frog on the napkin, not the frog. It's on the plate. So, yeah. And the language and the other way around, we don't quite have that flexibility in the current architecture. They're pretty primitive, I think. I should mention maybe here some of the things that my lab, but based attention. So that's why I find it interesting. One of the reasons why I find it interesting. Cool. And that ends the attention piece. I thought I might take a little bit about a societal version or do you want to do something else? No, listen, this is a cycle version. This is very important. Okay. Okay. Unfortunately, a meta reviewer for ethics for ACL. Oh, wonderful. Yeah, and constellations, as we say. So they sent me a couple of papers. I'm reading one last night. I figured you guys would help me with this. So I get a paper and two of the reviewers flagged the paper. It's the paper that takes a bunch of. Questions for measuring personality. Extraversion, introversion. And then it takes a bunch of people's Twitter data. Where the people have answered questionnaires that predicts their personality. That's awesome. Sounds sort of cool. Sort of thing I do a lot. So. A couple of the reviewers said, oh, this seems dubious. Ethically. Yeah, I think I, I'm with some. So what would, what should I be checking for? First of all. Not you first, then you could tell me what should I be checking for in terms of ethics here? Or if this is a. Journal public take on a comic publication. Yeah, yeah. What would we be worried about? Why are they worried? Why are people wasting my time? It's going to be on GitHub. Is that going to be a problem? Yeah. Yeah. I think I'm with some. So what would, what should I be checking for first of all? Not you first. Then you could tell me what should I be checking for? Yeah. What is the label? Well, there are five labels to the five factors of personality. And it will give a number for each one. And the labels are designed in theory by psychologists not to be insulting. So extroverts and introverts. Although Americans tend to be sort of pro extrovert. In general, the belief is that it's not really better to be extrovert than introvert. So that might be. Yeah. And agreeable or disagreeable. Would you rather be agreeable or disagreeable? I take the disagreeable sloth. Well, in fact, there's something to be said for being disagreeable because it means you're not as subject to societal pressures. So you can stand up better against your pressure. Well, that's one concern, but they're not going to release the names of the people. It's actually cute. So it's, so who knows the data game from. So it's probably, it's certainly not representative. I don't know where the data came from. So you might worry that it's misleading in the sense that there's some pieces. Not enough to reject it. Yep. The labels are accurate. So who knows the labels are accurate, but that's a quality concern rather than ethics concern. I might worry about it. So it could be used for evil in some sense. There was a great case. Not evil. Maybe we'll lose ethics. I can say that it will happen. There's a company called Admiral insurance company in England. They're using it. They're using it. Absolutely. So they're using it. They're using it. And that's why the cultures are very different. China. England, the US, even the US and England are quite different. They introduced a great product. They said, we sell car insurance. Share your Facebook with us. If you wish. We will entirely privately analyze it. And based on that will decide whether you're worthy of a good driving boats. They probably check for con check. Is this probably. And they launched this product purely opt in. all of these run into an NLP algorithm like this paper was doing and if you're eligible for a discount they give it to you. Good product, bad product. Evil product? I think evil is a different question. I'm asking a pure business question at the moment. Evil is another question. But it would work and would describe some proportion of it. Oh, it's accurate. So it's a question of nothing accuracy. We're talking about the CIDL. The F1 accuracy was great. That wasn't the problem. The product lasted. Good product. No? Sorry? You don't need your Facebook profile, but it helps to know how conscientious. Do you spell things correctly? Are you talking about your spelling? Are you texting as if you were drunk posting? Yep. If you're a careful person who actually tends to not drunk posting through in the morning, you're a careful person who's less likely to drive drunk in the morning. This is empirically true. Yeah, I believe it. There was no problem with this. The machine learning was flying with this product. The problem was this is signability, which I don't think it would have been a problem in the U.S. But I think I would predict, basically, can there be any group? Like, no, we just had about the product. Could we imagine that there's a group that the system appears to be discriminating against or is discriminating against? Could be. That was not actually why they, what happened lasted three. I guess that it's also the case. It's probably the case, which they didn't even worry about. Yes, they could well have the effect is in England. Like the U.S., England has a lot of people who speak English as a second language and like former empire and their language will look really different. Oh, it might have more spelling errors, even though they may not be less conscientious. So they would have discriminated against me, for example. I once ran a correlation of IQ test with Facebook language and I found a whole bunch of people in California who has scored low on IQ test. I was wondering what's going on. I thought, okay, I'm from California. I know that there are lots of Mexican Americans, but no, I was wrong. It turns out there are a lot of Filipinos and they speak Tagalog. And if you're, are they dumber than Americans, Filipinos, and everything? That sounds extremely unlikely. Seems unlikely, but are they scoring less well on English language IQ tests than native speakers of English? Yeah, me too. You too, yes. In fact, it turns out that Tagalog speakers in California, in my non-random sample, score much worse on English language IQ tests than people speak English as a first language. Oops. Anyway, popping back to England. Okay, but hold on. I want to know the story. Like they rolled that thing out and there were other groups like this. It lasted three days before the flap that they got was enough that they said, oh, on the second thought, we messed up. We apologize. We're not going to offer this product anymore. People viewed it as being too creepy, even though it was opt in, private, safe, and nobody complained about discrimination in England three years ago. Yeah, there must have been discrimination. There must have been discrimination, but that wasn't why they were forced to take it back. So what's next? No, no, no, we still have to answer this question because there is in fact one other big reason which is the real ethical issue that most reviewers are concerned with. What's the first thing I should check for? I would check for it. I did check for it on the paper. If you do a paper where you're publishing a paper in, say, natural language processing about people or images about people, what do I look to see whether they have? Sorry. Inclusiveness is nice, but that's not the first thing that's the obvious checklist. Yeah, data privacy would be one issue, which is getting closer to it. Who's? Yeah. Consented users would be a good thing. It turns out that the users did consent. I checked out to see the original thing, but the people who did the paper weren't actually doing the research. Somebody got someone else's data. What's the process? These are actually a German group, but in the US and most of the world, if you write a paper, whose job is it to check as a front line to see whether this is ethical? That's important to know. That's important to know, and the authors of this paper did not do this. Which is unfortunately very common for computer scientists. Psychology, everyone does this in computer science. People don't think about it. What is it? Institutional review board. All major universities and increasingly major companies have some organization in America in the universe called the institutional review board. Their job is to check both ethics and data privacy and data security, and particularly informed consent. Did you actually explain to the people what you're going to do with their data? Did you tell them you would keep it forever or share it with everybody or share it only with researchers or sell it or not sell it? So informed consent and an institutional review board had this paper actually checked with their institutional review board and signed off. I would immediately say, of course, publish it. Some professional has reviewed it, but they didn't. They said, we don't need to because we're only using someone else's data. Is it okay always to use someone else's data? They didn't request any data. It was someone else's data they used. There's a lot of computer scientists that do that. Of course, I do impression data. Not like ImageNet companies, people's faces. Was there informed consent? Of course there's never been informed consent. You mean these people are using a bunch of images of people without asking the people? There could be your face in it for all that I know. My face is on the web without asking them. That's right. And they scraped your data. That's how ImageNet was made. I do know that. There's a couple of companies that are scraping my data in the violation of Facebook and Twitter as a service and using them to sell products to the police. So I think the big takeaway from this piece is if you're doing anything that involves collecting data about people, like the personality of their Twitter or their Facebook or their images, you might want to check with someone else who does this thing for living and say, hey, guys, is there any problem? In this particular case, I think the IRB, as we call it, would have approved it. But since they didn't check now, unfortunately, I have now serving as their IRB. But it used to be that computer scientists would never do that. Never. And now we stop being more aware because it's people's data. They have a right to them. But this is the first year that we actually have a full formal review, ethics review process all the way in place for the Natural Language Conference. Cool office. So it's a bummer. Welcome to the new world. Cool. What time is it again? We can end now. This is really the piece that I wanted to point out. It's just that when you collect data, think about whose data and what did they give you permission. If you download a bunch of data in straight faces from the web and someone else shares it with you, it doesn't necessarily mean that people be chill about their sexual preferences being public. Well, in this case, they're the character. I wonder why you ended there. So you can say, there's also a risk of harm here. Like, if you can from a photograph know my character, you can exploit me even more. So I think in this case, if they were resharing the Twitter data, that would be a different issue. So one thing I've had, so I scraped lots of web data, the IRB here says, oh, Lyle, you scraped these public fora for people where all the data is public available and then you collapse it and analyze it and then publish the results. You can't do that because if you share the data, someone can go back and figure out even the summarized public data has information on it. And so in this particular case, the data was not going to be released at all. They weren't going to share it at all. It's under lock and key and the results they're publishing are abstract. So there's no data leakage. So that was actually clean. But you worry about it, right? Can I collect your tweets and re-publish them even though they're publicly available? Well, first of all, Twitter's terms of service has some limitations on resharing tweets. So it's a legal question. But note that even publicly available data when reshared in the right condensed format can actually be potentially damaging. So this particular paper wasn't going to reach other data so I wasn't worried about it. Awesome. We're at time. We're still talking to people about projects. We're still happy to meet with you full speed ahead and we'll do more MLT next week. It'll be fun. Cool, cool.